A brief description about what functions are available to read in
binary data is found below, along with a
note
about portability issues,
two example
scripts, and
hints on determining the
format of a binary file.
Each type of binary data has its own read function. You must know how
your data was written. (See note below for
hints on how to figure out what type of file you have.)
Direct Access: data = fbindirread(path,rec,dim,type)
Sequential Access: data = fbinrecread(path,rec,dim,type)
Cray (C block IO write): data = cbinread(path,dim,type)
Cray Sequential: data = craybinrecread(path,rec,dim,type)
Binary file portability
Binary files created on one machine may not be directly
portable to another machine. Just as some human languages
are read left-to-right and others right-to-left, computers
have an analogous situation. The terms used to describe
the way numbers are stored are big-endian and
little-endian. A big-endian representation means the
most significant byte is on the left while a little-endian
representation means the most significant byte is on the right.
Typical big-endian operating systems include: AIX, IRIX, SOLARIS,
Java Virtual Machine, etc. Typical little-endian machines include PCs
running linux or windows.
NCL allows users to read files created using, say, big-endian
machines on little-endian machines and vice versa via the
setfileoption
procedure. This procedure also allows the data to be written
according to a specified byte order.
setfileoption("bin","ReadByteOrder","LittleEndian")
v = cbinread("data.littleEndian.bin",-1,"float")
setfileoption("bin","WriteByteOrder","BigEndian")
cbinwrite("data.bigEndian.bin",v)
Hints on determining the exact format
of a binary file
[Thanks to Dave Allured of NOAA for this write-up.]
One general method of file sleuthing is to add up the expected byte
sizes of all of the arrays that are supposed to be in the binary file,
and compare with the actual file size. The exact difference value is
an indicator of the file format.
- Difference 0 implies flat binary file.
- Difference 8 implies Fortran sequential and single large record.
- Difference 8 * (number of variables) implies Fortran sequential
and each variable in a separate record.
If the difference is other than one of these three, then there are
other possibilities. There may be more or fewer variables in the file
than expected; some of the arrays may be packed or double precision;
the dimensions may be other than expected; there may be header records
in the file; the record structure is something other than the simple
possibilities above. You may be able to determine some of these other
possibilities by manual inspection of the binary file.
Fortran sequential binary files contain additional hidden information
about the length of each data record. This hidden information can be
used to determine the location and length of all of the records in the
file, and will help to determine the contents of an undocumented file.
The full record structure of a Fortran sequential binary file is:
[Length] [Record 1 data] [Length]
[Length] [Record 2 data] [Length]
[Length] [Record 3 data] [Length]
...
[End of file]
Each length indicator is usually a
single 4-byte
integer** that is the length in bytes of the
data portion of the record. The prefix and suffix lengths are the
same for each record. You can easily check the first record to
confirm or disprove that the file is really Fortran sequential.
If sequential, then by manual inspection you can *always* jump exactly
from the start to end of every record in the file, and thereby
determine the exact file offset and size of every record.
If you then compare the individual detected record sizes to the sizes
of the expected arrays, then you can start matching up variables to
record positions in the file. This will allow you to separate your 2D
and 3D arrays, for example.
This depends on the educated guess that each data variable was written
as a single Fortran sequential record. This is an assumption and is
not necessarily how they did it.
The Unix "od" command is helpful to inspect unknown files. You may
first need to convert a sample big endian file to little endian for
study, because I don't think od has an option to reverse the byte
order.
For example, this displays the start of the file as integers, allowing
you to see the length prefix integer on the first record. If this
integer matches the expected total size in bytes of one of your data
arrays, then you are on the right track:
od -Ad -i -N80 file.bin
By adding the -j option, you can offset to get the first record suffix
and the second record prefix. Suppose len(1) = 200, then offset 4 for
the first prefix plus 200 for the data:
od -Ad -i -N80 -j204 file.bin
Alternatively, you can use NCL to search for the length integers. The
example below reads the entire file as a single 1D array of integers,
then finds the start of each record by accumulating the offset.
The data size of each record and the prefix and suffix integers are printed:
setfileoption ("bin", "ReadByteOrder", "BigEndian")
idat = fbindirread ("jma_file", 0, -1, "integer")
dims = dimsizes (idat)
offset = 0
recnum = 0
do while (offset .lt. dims(0))
prefix = idat(offset)
print ("array size " + recnum + " = " + prefix/4)
print ("prefix " + recnum + " = " + prefix)
offset = offset + 2 + prefix/4 ; start of next record
suffix = idat(offset-1)
print ("suffix " + recnum + " = " + suffix)
recnum = recnum + 1
end do
This loop will crash with a subscript error if the file is not Fortran
sequential, or certain other assumptions are wrong.
Once you have better information about the file format, you can move
on to actually reading selected data arrays from the file.
** Note: on some systems,
the record indicator size might be 8 bytes. This can happen if the
compiler uses the size of "off_t" to determine the record indicator
length. To force a 4-byte record indicator with the g77 or gfortran
compilers, you can recompile your code with the option
"-frecord-marker=4".