NCL: Reading binary data

A brief description about what functions are available to read in binary data is found below, along with a note about portability issues, two example scripts, and hints on determining the format of a binary file.

Each type of binary data has its own read function. You must know how your data was written. (See note below for hints on how to figure out what type of file you have.)

    Direct Access:            data = fbindirread(path,rec,dim,type)

    Sequential Access:        data = fbinrecread(path,rec,dim,type)

    Cray (C block IO write):  data = cbinread(path,dim,type)

    Cray Sequential:          data = craybinrecread(path,rec,dim,type)

Binary file portability

Binary files created on one machine may not be directly portable to another machine. Just as some human languages are read left-to-right and others right-to-left, computers have an analogous situation. The terms used to describe the way numbers are stored are big-endian and little-endian. A big-endian representation means the most significant byte is on the left while a little-endian representation means the most significant byte is on the right.

Typical big-endian operating systems include: AIX, IRIX, SOLARIS, Java Virtual Machine, etc. Typical little-endian machines include PCs running linux or windows.

NCL allows users to read files created using, say, big-endian machines on little-endian machines and vice versa via the setfileoption procedure. This procedure also allows the data to be written according to a specified byte order.

     setfileoption("bin","ReadByteOrder","LittleEndian")
     v = cbinread("data.littleEndian.bin",-1,"float")

     setfileoption("bin","WriteByteOrder","BigEndian")
     cbinwrite("data.bigEndian.bin",v)

Hints on determining the exact format of a binary file

[Thanks to Dave Allured of NOAA for this write-up.]

One general method of file sleuthing is to add up the expected byte sizes of all of the arrays that are supposed to be in the binary file, and compare with the actual file size. The exact difference value is an indicator of the file format.

Difference 0 implies flat binary file.
Difference 8 implies Fortran sequential and single large record.
Difference 8 * (number of variables) implies Fortran sequential and each variable in a separate record.

If the difference is other than one of these three, then there are other possibilities. There may be more or fewer variables in the file than expected; some of the arrays may be packed or double precision; the dimensions may be other than expected; there may be header records in the file; the record structure is something other than the simple possibilities above. You may be able to determine some of these other possibilities by manual inspection of the binary file.

Fortran sequential binary files contain additional hidden information about the length of each data record. This hidden information can be used to determine the location and length of all of the records in the file, and will help to determine the contents of an undocumented file.

The full record structure of a Fortran sequential binary file is:

[Length] [Record 1 data] [Length]
[Length] [Record 2 data] [Length]
[Length] [Record 3 data] [Length]
...
[End of file]

Each length indicator is usually a single 4-byte integer** that is the length in bytes of the data portion of the record. The prefix and suffix lengths are the same for each record. You can easily check the first record to confirm or disprove that the file is really Fortran sequential.

If sequential, then by manual inspection you can *always* jump exactly from the start to end of every record in the file, and thereby determine the exact file offset and size of every record.

If you then compare the individual detected record sizes to the sizes of the expected arrays, then you can start matching up variables to record positions in the file. This will allow you to separate your 2D and 3D arrays, for example.

This depends on the educated guess that each data variable was written as a single Fortran sequential record. This is an assumption and is not necessarily how they did it.

The Unix "od" command is helpful to inspect unknown files. You may first need to convert a sample big endian file to little endian for study, because I don't think od has an option to reverse the byte order.

For example, this displays the start of the file as integers, allowing you to see the length prefix integer on the first record. If this integer matches the expected total size in bytes of one of your data arrays, then you are on the right track:

  od -Ad -i -N80 file.bin

By adding the -j option, you can offset to get the first record suffix and the second record prefix. Suppose len(1) = 200, then offset 4 for the first prefix plus 200 for the data:

  od -Ad -i -N80 -j204 file.bin

Alternatively, you can use NCL to search for the length integers. The example below reads the entire file as a single 1D array of integers, then finds the start of each record by accumulating the offset.

The data size of each record and the prefix and suffix integers are printed:

  setfileoption ("bin", "ReadByteOrder", "BigEndian")
  idat = fbindirread ("jma_file", 0, -1, "integer")
  dims = dimsizes (idat)
  offset = 0
  recnum = 0

  do while (offset .lt. dims(0))
    prefix = idat(offset)
    print ("array size " + recnum + " = " + prefix/4)
    print ("prefix " + recnum + " = " + prefix)

    offset = offset + 2 + prefix/4    ; start of next record
    suffix = idat(offset-1)
    print ("suffix " + recnum + " = " + suffix)
    recnum = recnum + 1

  end do

This loop will crash with a subscript error if the file is not Fortran sequential, or certain other assumptions are wrong.

Once you have better information about the file format, you can move on to actually reading selected data arrays from the file.

** Note: on some systems, the record indicator size might be 8 bytes. This can happen if the compiler uses the size of "off_t" to determine the record indicator length. To force a 4-byte record indicator with the g77 or gfortran compilers, you can recompile your code with the option "-frecord-marker=4".

read_bin_1.ncl: This example shows how to read several records off an unformatted Fortran binary file. The assumption here is that the binary file has the same "endianness" as the machine you are running this script on. Otherwise, you need to use setfileoption as described above to change the endianness.

Note that the meta data needs to be assigned since binary data does not contain meta data.

read_bin_2.ncl: This script shows how to read data off record 0 of several Fortran binary files into a single array, and then write it to a NetCDF file.

As with the above example, you need to create and assign the meta data yourself before writing the variable to a NetCDF file.