The basic things are trivial, including mpi_file_open, mpi_file_close
. Normally, when the file is opened, we write data into the file using mpi_file_write
or mpi_file_write_at
. (mpi_file_write_all
or mpi_file_write_at_all
are the collective version.)
Then if we want to write data again, we need to reconfigure the file pointer or the explicit offset. For the explicit offset, we calculate it by counting the number of data elements already written. For the file pointer, we could also explicitly calculate it by counting or use the procedures mpi_file_get_position
and mpi_file_get_byte_offset
combined with explicitly calculating the starting location of the next write operation.
mpi_file_get_position
gets the current position of the individual file pointer in units of etype. The result is offset
. Then we provide offset
to mpi_file_get_byte_offset
to convert a view relative offset
(in units of etype) into a displacement in bytes from the beginning of the file. The result is disp
.
Next we could use mpi_file_set_view
to change the file view of the process. Then we could use mpi_file_write_all
for the parallel writing operation. Though mpi_file_write_all
is a blocking function. Note, currently I am not able to figure out how to use mpi_file_write_at
when the previous write is done by mpi_file_set_view
and mpi_file_write
.
A file view is a triplet of arguments (displacement, etype, filetype
) that is passed to MPI_File_set_view
.
displacement
= number of bytes to be skipped from the start of the fileetype
= unit of data access (can be any basic or derived datatype)filetype
= specifies layout of etypes within fileThe file view sets the starting location to write by specifying displacement
. The displacement
is measured from the head of the file.
References
program io
implicit none
continue
!
call writeBinFile(2)
!
contains
!
subroutine writeBinFile(n_write)
use mpi
integer, intent(in) :: n_write
integer ierr, i, myrank, nrank, BUFSIZE, BYTESIZE, thefile
parameter (BUFSIZE=10, BYTESIZE=BUFSIZE*4)
integer buf(BUFSIZE)
integer(kind=MPI_OFFSET_KIND) disp
continue
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nrank, ierr)
do i = 0, BUFSIZE
= myrank * BUFSIZE + myrank
buf(i) end do
call MPI_FILE_OPEN(MPI_COMM_WORLD, 'mpi_data.bin', &
+ MPI_MODE_CREATE, &
MPI_MODE_WRONLY
MPI_INFO_NULL, thefile, ierr)= myrank * BYTESIZE
disp ! 1. Individual file poiner
! call MPI_FILE_SET_VIEW(thefile, disp, MPI_INTEGER, &
! MPI_INTEGER, 'native', &
! MPI_INFO_NULL, ierr)
! call MPI_FILE_WRITE_ALL(thefile, buf, BUFSIZE, MPI_INTEGER, &
! MPI_STATUS_IGNORE, ierr)
! 2. Explicit offset
call MPI_FILE_WRITE_AT_ALL(thefile, disp, buf, BUFSIZE, MPI_INTEGER, &
MPI_STATUS_IGNORE, ierr)if ( n_write > 1 ) then
!
= nrank * BYTESIZE + myrank * BYTESIZE
disp ! 1. Individual file poiner
! call MPI_FILE_SET_VIEW(thefile, disp, MPI_INTEGER, &
! MPI_INTEGER, 'native', &
! MPI_INFO_NULL, ierr)
! call MPI_FILE_WRITE_ALL(thefile, buf, BUFSIZE, MPI_INTEGER, &
! MPI_STATUS_IGNORE, ierr)
! 2. Explicit offset
call MPI_FILE_WRITE_AT_ALL(thefile, disp, buf, BUFSIZE, MPI_INTEGER, &
MPI_STATUS_IGNORE, ierr)!
end if
!
call MPI_FILE_CLOSE(thefile, ierr)
call MPI_FINALIZE(ierr)
end subroutine writeBinFile
!
end program
To write to a file multiple times, we should notice that the starting location of the write operation must be set again after a write opeartion. As the above example shows, 2 simple plans for writing twice. 1) 2 times of (MPI_FILE_SET_VIEW
+ MPI_FILE_WRITE_ALL
) 2) 2 times of (MPI_FILE_WRITE_AT_ALL
). In both plans, displacement
is measured from the head of the file for both write opeartions. Another hybrid plan is (MPI_FILE_WRITE_AT_ALL
) + (MPI_FILE_SET_VIEW
+ MPI_FILE_WRITE_ALL
) using the absolute displacement
measured from the head of the file for both write opeartions. The last hybrid plan is (MPI_FILE_SET_VIEW
+ MPI_FILE_WRITE_ALL
) + (MPI_FILE_WRITE_AT_ALL
). But it seems like the file view messes up the displacement for MPI_FILE_WRITE_AT_ALL
. I am not able to figure out how to correctly implement this hybrid plan.
After one write operation, the explicit setting displacement
could be replaced by the procedures as follows,
= nrank * BYTESIZE + myrank * BYTESIZE
disp ! the following lines give the same disp as the above line
call MPI_FILE_GET_POSITION(thefile,offset,ierr)
call MPI_FILE_GET_BYTE_OFFSET(thefile,offset,disp,ierr)
= disp + (nrank-1)*BYTESIZE disp
Since one write operation takes BYTESIZE
, it needs to move the file pointer forward by (nrank-1)*BYTESIZE
.
The indexed datatype basically includes block offset and block length. It includes multiple sections of an array to make up of a new element of the newly created datatype.
For reference, check P14 in the following attached MPI DataTypes PDF.
In Taicang cluster (太仓集群), we have to use the following commands to run on multi nodes,
mpirun -n 4 -hosts node2,node3 -perhost 2 -env I_MPI_FABRICS tcp ./test
When the cluster has virtual NIC in addition to the real NIC, OpenMPI hangs with the default parameter. The reason is that OpenMPI uses tcp to connect to the wrong NIC, i.e. virbr0
. The Taicang cluster has the real NIC enp97s0f1
and virbr0
. The correct command of running OpenMPI across multiple nodes is
mpirun -x LD_LIBRARY_PATH -n 2 -hostfile machine --mca btl_tcp_if_include enp97s0f1 ~/NFS_Project/NFS/relwithdebinfo_gnu/bin/nfs_opt_g nfs.json
The content of the machine file for running 2 procs is
node3 slots=1
node4 slots=1
Reference: 7. How do I tell Open MPI which IP interfaces / networks to use?
Running NFR code on the remote nodes requires the complete set up of environment variables. NFR compiled by OpenMPI-4 and gfortran-10.2 requires the loading of libgfortran.so.5
. Incomplete set up of the environment variables results in error while loading shared libraries: libgfortran.so.5
. The correct command is
mpirun -x LD_LIBRARY_PATH -n 2 -hostfile machine --mca btl_tcp_if_include enp97s0f1 ~/NFS_Project/NFS/relwithdebinfo_gnu/bin/nfs_opt_g nfs.json