Re: Reading Binary files with mixed-dimensions

From: Dave Allured <dave.allured_at_nyahnyahspammersnyahnyah>
Date: Mon, 04 Feb 2008 19:04:33 -0700

Dave Porter,

Sorry for the delay in responding, I was out for a few days.

You have two distinct problems to solve: Determine the exact format
of a complex binary file with inadequate documentation, and read
selected arrays from the file with NCL. You need to solve the file
format first, which is technically outside the scope of NCL.
However, since this is a common problem, I would like to add some to
Dennis Shea's comments.

One general method of file sleuthing is to add up the expected byte
sizes of all of the arrays that are supposed to be in the binary
file, and compare with the actual file size. The exact difference
value is an indicator of the file format.

   Difference 0 implies flat binary file.
   Difference 8 implies Fortran sequential and single large record.
   Difference 8 * (number of variables) implies Fortran sequential
      and each variable in a separate record.

Difference 0 is the same as Dennis's length formula for a flat file,
I am just stating it differently.

If the difference is other than one of these three, then there are
other possibilities. There may be more or fewer variables in the
file than expected; some of the arrays may be packed or double
precision; the dimensions may be other than expected; there may be
header records in the file; the record structure is something other
than the simple possibilities above. You may be able to determine
some of these other possibilities by manual inspection of the binary
file.

Fortran sequential binary files contain additional hidden
information about the length of each data record. This hidden
information can be used to determine the location and length of all
of the records in the file, and will help to determine the contents
of an undocumented file.

The full record structure of a Fortran sequential binary file is:

   [Length] [Record 1 data] [Length]
   [Length] [Record 2 data] [Length]
   [Length] [Record 3 data] [Length]
   ...
   [End of file]

Each length indicator is usually a single 4-byte integer that is the
length in bytes of the data portion of the record. The prefix and
suffix lengths are the same for each record. You can easily check
the first record to confirm or disprove that the file is really
Fortran sequential.

If sequential, then by manual inspection you can *always* jump
exactly from the start to end of every record in the file, and
thereby determine the exact file offset and size of every record.

If you then compare the individual detected record sizes to the
sizes of the expected arrays, then you can start matching up
variables to record positions in the file. This will allow you to
separate your 2D and 3D arrays, for example.

This depends on the educated guess that each data variable was
written as a single Fortran sequential record. This is an
assumption and is not necessarily how they did it.

The Unix "od" command is helpful to inspect unknown files. You may
first need to convert a sample big endian file to little endian for
study, because I don't think od has an option to reverse the byte order.

For example, this displays the start of the file as integers,
allowing you to see the length prefix integer on the first record.
If this integer matches the expected total size in bytes of one of
your data arrays, then you are on the right track:

   od -Ad -i -N80 file.bin

By adding the -j option, you can offset to get the first record
suffix and the second record prefix. Suppose len(1) = 200, then
offset 4 for the first prefix plus 200 for the data:

   od -Ad -i -N80 -j204 file.bin

Alternatively, you can use NCL to search for the length integers.
Similar to Dennis's example, this reads the entire file as a single
1D array of integers, then finds the start of each record by
accumulating the offset. The data size of each record and the
prefix and suffix integers are printed:

   setfileoption ("bin", "ReadByteOrder", "BigEndian")
   idat = fbindirread ("jma_file", 0, -1, "integer")
   dims = dimsizes (idat)
   offset = 0
   recnum = 0

   do while (offset .lt. dims(0))
     prefix = idat(offset)
     print ("array size " + recnum + " = " + prefix/4)
     print ("prefix " + recnum + " = " + prefix)

     offset = offset + 2 + prefix/4 ; start of next record
     suffix = idat(offset-1)
     print ("suffix " + recnum + " = " + suffix)
     recnum = recnum + 1
   end do

This loop will crash with a subscript error if the file is not
Fortran sequential, or certain other assumptions are wrong.

Once you have better information about the file format, you can move
on to actually reading selected data arrays from the file. HTH.

Dave Allured
CU/CIRES Climate Diagnostics Center (CDC)
http://cires.colorado.edu/science/centers/cdc/
NOAA/ESRL/PSD, Climate Analysis Branch (CAB)
http://www.cdc.noaa.gov/

Dennis Shea wrote:
> Not sure what to say when you say "no documentation for how the binary
> was written". Given the above lack of information, it is not
> reasonable to
> expect any tool to automatically read the files.
>
> Files: Flat or written as a fortran sequential file which has
> extra record information silently embedded
>
> Hopefully, you know the sizes of the dimension [klev1, klev2, nlat,
> mlon].
> Each float is 4-bytes
>
> 4 (bytes) *[ 4 (variables) * mlon*nlat*klev1 * 1 (variable)
> *mlon*nlat*klev2 +
> 5 (variables) * mlon*nlat] = total # bytes for
> flat binary
>
> If this number matches the number of bytes in the file, you know it is
> a flat file.
>
> setfileoption("bin","ReadByteOrder","BigEndian")
> x_1d = fbindirread("jma_file", -1, "float")
>
> **If** you knew the order you could do something like the following
>
> mn = nlat*mlon
>
> nStrt = 0
> nLast = nm-1
>
> a_2d = onedtond( x_1d(nStrt:nLast), (/nlat,mlon/) )
> nStrt = nLast+1
> nLast = nStrt + mn
> b_2d = onedtond( x_1d(nStrt:nLast), (/nlat,mlon/) )
>
> However, you should also know the meta data if you want
> to create a good netCDF file.
>
> If the bytes do not match, it is a fortran sequential file ....
> you would have to determine the information from the extra bytes.
>
>
>
> David F Porter wrote:
>> Dave,
>>
>> Well, I'm trying to read in the Japanese Reanalysis data, of which
>> there is no documentation for how the binary was written. Most of the
>> data is in GRIB, but monthly means are in binary. I've tried both
>> direct and sequential access routines in NCL (attempting to just read
>> in the first record of the file), and neither worked. But then again,
>> I'm not positive of the order of the data in the files either.
> Not sure what you mean ... "neither worked"
>>
>> To clarify, each file is for 1 time period (6-hourlies in GRIB,
>> monthlies in binary). Each file contains 10 variables. The problem
>> is, 4 variables are [lon,lat,lev] , one is the same but a different
>> number of levels, and then 5 are just [lon,lat]. To make things more
>> interesting, I'm not sure of the order that each variable was written
>> (it is slow communicating with the JMA).
> JMA should do a better job.
>>
>> I tried converting it to netCDF using IDL, which I've had success
>> doing with other binary files by just pointing to the starting byte
>> for each variable. I used the order of the variables in the
>> documentation (anl_mdl listed here
>> http://jra.kishou.go.jp/elements_en.html and also the order of
>> variables given by the NCL function printFileVarSummary() after
>> reading in the corresponding 6-hourly GRIB files (the same dimensions,
>> just different time and format).
>
>>
>> Dave Porter
>>
>>
>>
>> On Jan 23, 2008, at 4:27 PM, Dave Allured wrote:
>>
>>> Dave,
>>>
>>> Can you be more specific as to the type of binary file? Fortran
>>> "unformatted sequential access"; plain binary such as written by
>>> Fortran direct access; or something else? I just want to be sure I'm
>>> on the right wavelength before responding.
>>>
>>> Also, assuming one of the first two: So the record length varies
>>> within each file? Does each file have its own unique layout, or is
>>> each variable found in exactly the same position in every file?
>>>
>>> Dave Allured
>>> CU/CIRES Climate Diagnostics Center (CDC)
>>> http://cires.colorado.edu/science/centers/cdc/
>>> NOAA/ESRL/PSD, Climate Analysis Branch (CAB)
>>> http://www.cdc.noaa.gov/
>>>
>>> David F Porter wrote:
>>>> Sorry if this has been covered, by I've exhausted the search
>>>> function with no real results.
>>>> I am looking to read in some large 4-Byte Float big-endian binary
>>>> data onto my little-endian machine. The problem I am having is that
>>>> each file corresponds to ONE time period, but each variable in the
>>>> file has different dimensions, some 2D and some 3D. Because of the
>>>> varying sizes, I feel that I cannot simply use the "record number".
>>>> Also, I only want some of the variables (to save space after loading
>>>> 300 of these files).
>>>> I'm not sure if it matters at this point, but the variables are on a
>>>> gaussian grid.
>>>> Dave
>>> _______________________________________________
>>> ncl-talk mailing list
>>> ncl-talk_at_ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>
>> _______________________________________________
>> ncl-talk mailing list
>> ncl-talk_at_ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>
>

_______________________________________________
ncl-talk mailing list
ncl-talk_at_ucar.edu
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Mon Feb 04 2008 - 19:04:33 MST

This archive was generated by hypermail 2.2.0 : Tue Feb 05 2008 - 17:17:22 MST