Linear Solver - Different behaviours on different OS

Clearly defined bug reports and their fixes
Post Reply
Posts: 9
Joined: 25 Jun 2020, 16:45
Antispam: Yes

Linear Solver - Different behaviours on different OS

Post by kfourteau »

Hello everyone,

I am currently experiencing a strange behaviour of Elmer, and I really do not understand its origin.

In a nutshell: though using the same Elmer version (9.0), same sif file and same mesh, I am getting very different results on different computers.

My workflow is simply to solve the strain-imposed compression of a piece of porous material, in steady-state and using an isotropic non-linear viscoplastic rheology for the constitutive material. The porous media meshes are produced with the CGAL library and converted to the Elmer format with a personal program.

I'm running the simulations on two different HPCs, one based on Centos 7 on which I did not personally do the installation, and another based on Ubuntu 20 where I manually compiled Elmer (linked against, MMG, Hypre, and Mumps). Both machines use Elmer 9.0. After the installation on the Ubuntu machine, 85% of the tests pass the ctest.

The resolution of the linear system does not occur properly on the Ubuntu machine. I've put a dropbox link below towards an archive with a test case. When I'm running the simulation with the mesh mesh_100 (sif file is Compression_100.sif), the pre-conditioner manages to obtain quite low residuals (1e-6), which is below my standard convergence criterion (1e-4). However, when I look at the proposed solution it is clearly faulty (and just correspond to the initial solution provided in the sif). However, when I run the same case of the Centos HPC, everything looks fine.

From there I've tested a few things on the Ubuntu machine:
- Modifying the initial condition from imposing the average strain everywhere to zero everywhere. The residual after the pre-conditioner remains below the convergence criterion, and Elmer simply outputs the initial condition.
- I dropped the convergence criterion to 1e-10. In this case the residual after the pre-conditioner is thus above the criterion, and in this case the linear solver diverges.
- If I use a smaller mesh (mesh_50 in the attached archive), then everything occurs nicely (residuals are not so low just after the pre-conditioner, and the linear solver converge afterwards). Here the Centos and Ubuntu machine behave similarly.
- If I use a large mesh with a simple geometry produced with GMSH (not with CGAL) and then convert it with ElmerGrid, everything occurs nicely.
- I've tested on another Ubuntu 20 machine. I have the same behaviour.
- The problem occurs whether I am running the problem sequentially or in parallel.
- If I increase the "Critical Shear Rate" of the material, things start to look normal on Ubuntu.
- I also realised that simulations on the Ubuntu machine are prone to producing the error "WARNING:: RealBiCGStab(l): kappal^2 is non-positive, iteration halted" during the Linear Solving stage. It is something I seldom encounter with Centos.

I am quite lost, and have no idea why the two machines behave differently. Visibly it could be related to the mesh (as it only occur with sufficiently large CGAL meshes), to the libraries/os (as the behaviour is different on Ubuntu and Centos), and/or to the way the effective viscosity is computed in the material law.

Does any one have any idea on the origin of this problem? Let me know if you require some more informations (specific version of the libraries, etc).

Thanks a lot!

Posts: 1158
Joined: 25 Jan 2019, 01:28
Antispam: Yes

Re: Linear Solver - Different behaviours on different OS

Post by kevinarden »

The Centos is based on fedora/red hat. Ubuntu is Debian based. There is no readily available binary for fedora/red hat the code has to be compiled. The Ubuntu has a binary release and a nightly update. Therefore, even though they are bot Elmer 9, the Centos one is likely static whereas the Ubuntu version may be updated everyday, and there are nearly daily code changes. This means it is possible the two installed codes are not the same.
Posts: 9
Joined: 25 Jun 2020, 16:45
Antispam: Yes

Re: Linear Solver - Different behaviours on different OS

Post by kfourteau »

Thanks for the response (and sorry for the delay on my side, I wanted to do a few more tests before posting).

The two Elmer installation on Ubuntu and Centos were manually compiled (no packaged binaries involved). I tried installing several version of Elmer on the Ubuntu machine (sources some from Elmer 8.4, from the elmerice branch on github, etc) and the behavior is still the same: if the "Critical Shear Rate" parameter in Glen's law is too low, the computation of the residuals does not make sense on Ubuntu, while it seems fine on Centos. If i increase this parameter, Ubuntu and Centos behaves similarly.

Reading the code, I would expect that having a very low Critical Shear Rate simply implies that the linear behavior is never enforced when computing the effective viscosity of the material. I would thus have said that setting a very low value for the Critical Shear Rate should not have any impact on the simulations. But clearly is has one on my Ubuntu simulations.

For now, I increased this Critical Shear Rate parameter to get rid of the very strange computed residuals. But I'm still wondering what is concretely happening and if it means there's something wrong in some of the installs.

Site Admin
Posts: 4133
Joined: 22 Aug 2009, 11:57
Antispam: Yes
Location: Espoo, Finland

Re: Linear Solver - Different behaviours on different OS

Post by raback »

Hi Kevin,

This is indeed strange behavior.

When this happens I usually try to isolate the problem:
* What is the smallest size of case the problem occurs
* What is the 1st timestep where the problem occurs

Then I raise the "Max Output Level" to at least 20 or so, 32 is the maximum. This includes some additional debugging info. Run the exactly same problem on the two platforms (or two versions etc.) and direct the output to a file.

Then take some advanced diff tool - my favourite is "meld" - and see where the two cases start to diverge. If you share such two files, I can also try to make my best guess what is happening.

Post Reply