🐛 Your bug may already be reported!
Please search on the issue tracker before creating a new issue.
🐒 🐾 🐝
Describe the bug
-
Run a model with a large enough size, on few tasks, on say your laptop
model size = 152024914 (wrf-chem)
ntasks = 4
-
What was the expected outcome?
Run ok, or just run out of memory
-
What actually happened?
Wacky window size, -1558518808
Before computing prior observation values TIME: 2026/02/23 13:59:58
size of window 8 size of bitesize 4
size of product 4 size of product 4
MPI_ADDRESS_KIND size: 8
i4 size: 4 i8 size: 8 r8 size: 8
HK window size -1558518808 bytesize 8 my_num_vars 38006229 copies 9
[cisl-sapulpa:00000] *** An error occurred in MPI_Win_create
win_mod.f90, printing out these values
call mpi_type_size(datasize, bytesize, ierr)
window_size = my_num_vars*state_ens_handle%num_copies*bytesize
print*, 'size of window', sizeof(window_size), 'size of bitesize', sizeof(bytesize)
print*, 'size of product', sizeof(my_num_vars*state_ens_handle%num_copies*bytesize), 'size of product', sizeof(my_num_vars*state_ens_handle%num_copies)
print *, 'MPI_ADDRESS_KIND size:', sizeof(0_MPI_ADDRESS_KIND)
print *, 'i4 size:', sizeof(0_i4), 'i8 size:', sizeof(0_i8), 'r8 size:', sizeof(0_r8)
print*, 'HK window size', window_size, 'bytesize', bytesize, 'my_num_vars', my_num_vars, 'copies', state_ens_handle%num_copies
! Expose local memory to RMA operation by other processes in a communicator.
call mpi_win_create(state_ens_handle%copies, window_size, bytesize, MPI_INFO_NULL, get_dart_mpi_comm(), state_win, ierr)
endif
Overflowing: my_num_vars*num_copies*bitesize = 38006229 * 9 * 8 = 38006229
38006229 - 4,294,967,296 = -1,558,518,808
MPI_ADDRESS_KIND is i8 but the numbers in the multiplication are i4
Error Message
[cisl-sapulpa:00000] *** An error occurred in MPI_Win_create
[cisl-sapulpa:00000] *** reported by process [2043936769,0]
[cisl-sapulpa:00000] *** on communicator MPI COMM 3 DUP FROM 0
[cisl-sapulpa:00000] *** MPI_ERR_SIZE: invalid size
[cisl-sapulpa:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cisl-sapulpa:00000] *** and MPI will try to terminate your MPI job as well)
Which model(s) are you working with?
wrf-chem (unified) but I believe this would affect any state/core combo that has window size > 2,147,483,647
Screenshots
If applicable, add screenshots to help explain your problem.
No thanks.
Version of DART
Which version of DART are you using?
You can find the version using git describe --tags
v11.20.1-50-ge4383af03
This is wrf-unified branch with a bunch of prints, I will make a small reproducer.
Have you modified the DART code?
Yes, to print out,
and to test setting my_num_vars i8 in win_mod.f90 have a large window (runs successfully)
Build information
Please describe:
-
The machine you are running on (e.g. windows laptop, NSF NCAR supercomputer Derecho).
mac M
-
The compiler you are using (e.g. gnu, intel).
mpif90 --version
GNU Fortran (MacPorts gcc14 14.2.0_3+stdlib_flag) 14.2.0
openmpi @5.0.7
Notes:
-
Ensemble manager has num_vars i8, but my_num_vars i4, so an ensemble type with num_vars/ntasks >
2,147,483,647 cannot describe itself.
For the window, even if you limit my_num_vars to i4, you still needs to count over i8
num_vars/ntasks * copies * bitsize > 2,147,483,647 so the window_mod has to cope with i8. Not sure a the
moment what if anything is stopping ensemble_type%my_num_vars being i8.
|
integer(i8) :: num_vars |
|
integer :: num_copies, my_num_copies, my_num_vars |
|
integer, allocatable :: my_copies(:) |
|
integer(i8), allocatable :: my_vars(:) |
- Not tested other compilers.
🐛 Your bug may already be reported!
Please search on the issue tracker before creating a new issue.
🐒 🐾 🐝
Describe the bug
Run a model with a large enough size, on few tasks, on say your laptop
model size = 152024914 (wrf-chem)
ntasks = 4
What was the expected outcome?
Run ok, or just run out of memory
What actually happened?
Wacky window size, -1558518808
win_mod.f90, printing out these values
Overflowing:
my_num_vars*num_copies*bitesize = 38006229 * 9 * 8 = 3800622938006229 - 4,294,967,296 = -1,558,518,808
MPI_ADDRESS_KIND is i8 but the numbers in the multiplication are i4
Error Message
Which model(s) are you working with?
wrf-chem (unified) but I believe this would affect any state/core combo that has window size > 2,147,483,647
Screenshots
If applicable, add screenshots to help explain your problem.
No thanks.
Version of DART
Which version of DART are you using?
You can find the version using
git describe --tagsv11.20.1-50-ge4383af03
This is wrf-unified branch with a bunch of prints, I will make a small reproducer.
Have you modified the DART code?
Yes, to print out,
and to test setting my_num_vars i8 in win_mod.f90 have a large window (runs successfully)
Build information
Please describe:
The machine you are running on (e.g. windows laptop, NSF NCAR supercomputer Derecho).
mac M
The compiler you are using (e.g. gnu, intel).
mpif90 --version
GNU Fortran (MacPorts gcc14 14.2.0_3+stdlib_flag) 14.2.0
openmpi @5.0.7
Notes:
Ensemble manager has num_vars i8, but my_num_vars i4, so an ensemble type with num_vars/ntasks >
2,147,483,647 cannot describe itself.
For the window, even if you limit my_num_vars to i4, you still needs to count over i8
num_vars/ntasks * copies * bitsize > 2,147,483,647 so the window_mod has to cope with i8. Not sure a the
moment what if anything is stopping ensemble_type%my_num_vars being i8.
DART/assimilation_code/modules/utilities/ensemble_manager_mod.f90
Lines 56 to 59 in cb85f91