Skip to content

bug: MPI_Win_create error for state window when using large state with few tasks #1051

@hkershaw-brown

Description

@hkershaw-brown

🐛 Your bug may already be reported!
Please search on the issue tracker before creating a new issue.
🐒 🐾 🐝

Describe the bug

  1. Run a model with a large enough size, on few tasks, on say your laptop
    model size = 152024914 (wrf-chem)
    ntasks = 4

  2. What was the expected outcome?
    Run ok, or just run out of memory

  3. What actually happened?
    Wacky window size, -1558518808

 Before computing prior observation values TIME: 2026/02/23 13:59:58
 size of window                    8 size of bitesize                    4
 size of product                    4 size of product                    4
 MPI_ADDRESS_KIND size:                    8
 i4 size:                    4 i8 size:                    8 r8 size:                    8
 HK window size          -1558518808 bytesize           8 my_num_vars    38006229 copies           9
[cisl-sapulpa:00000] *** An error occurred in MPI_Win_create

win_mod.f90, printing out these values

   call mpi_type_size(datasize, bytesize, ierr)
   window_size = my_num_vars*state_ens_handle%num_copies*bytesize
print*, 'size of window', sizeof(window_size), 'size of bitesize', sizeof(bytesize)
print*, 'size of product', sizeof(my_num_vars*state_ens_handle%num_copies*bytesize), 'size of product', sizeof(my_num_vars*state_ens_handle%num_copies)
print *, 'MPI_ADDRESS_KIND size:', sizeof(0_MPI_ADDRESS_KIND)
print *, 'i4 size:', sizeof(0_i4), 'i8 size:', sizeof(0_i8), 'r8 size:', sizeof(0_r8)
print*, 'HK window size', window_size, 'bytesize', bytesize, 'my_num_vars', my_num_vars, 'copies', state_ens_handle%num_copies
   ! Expose local memory to RMA operation by other processes in a communicator.
   call mpi_win_create(state_ens_handle%copies, window_size, bytesize, MPI_INFO_NULL, get_dart_mpi_comm(), state_win, ierr)
endif

Overflowing: my_num_vars*num_copies*bitesize = 38006229 * 9 * 8 = 38006229

38006229 - 4,294,967,296 = -1,558,518,808

MPI_ADDRESS_KIND is i8 but the numbers in the multiplication are i4

Error Message

[cisl-sapulpa:00000] *** An error occurred in MPI_Win_create
[cisl-sapulpa:00000] *** reported by process [2043936769,0]
[cisl-sapulpa:00000] *** on communicator MPI COMM 3 DUP FROM 0
[cisl-sapulpa:00000] *** MPI_ERR_SIZE: invalid size
[cisl-sapulpa:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cisl-sapulpa:00000] ***    and MPI will try to terminate your MPI job as well)

Which model(s) are you working with?

wrf-chem (unified) but I believe this would affect any state/core combo that has window size > 2,147,483,647

Screenshots

If applicable, add screenshots to help explain your problem.
No thanks.

Version of DART

Which version of DART are you using?
You can find the version using git describe --tags
v11.20.1-50-ge4383af03
This is wrf-unified branch with a bunch of prints, I will make a small reproducer.

Have you modified the DART code?

Yes, to print out,
and to test setting my_num_vars i8 in win_mod.f90 have a large window (runs successfully)

Build information

Please describe:

  1. The machine you are running on (e.g. windows laptop, NSF NCAR supercomputer Derecho).
    mac M

  2. The compiler you are using (e.g. gnu, intel).
    mpif90 --version
    GNU Fortran (MacPorts gcc14 14.2.0_3+stdlib_flag) 14.2.0
    openmpi @5.0.7

Notes:

  • Ensemble manager has num_vars i8, but my_num_vars i4, so an ensemble type with num_vars/ntasks >
    2,147,483,647 cannot describe itself.

    For the window, even if you limit my_num_vars to i4, you still needs to count over i8
    num_vars/ntasks * copies * bitsize > 2,147,483,647 so the window_mod has to cope with i8. Not sure a the
    moment what if anything is stopping ensemble_type%my_num_vars being i8.

integer(i8) :: num_vars
integer :: num_copies, my_num_copies, my_num_vars
integer, allocatable :: my_copies(:)
integer(i8), allocatable :: my_vars(:)

  • Not tested other compilers.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions