Re: Reading Binary files with mixed-dimensions

From: David F Porter <PorterDF_at_nyahnyahspammersnyahnyah>
Date: Mon, 11 Feb 2008 13:05:25 -0700

Dave,

Thanks for your extended reply. Fortunately, I have determined that
it is a flat binary file. The limited documentation was misleading
for one of the records (implying specific humidity was only on 28 of
the levels, when in reality it is just unreliable above that). I was
having a terrible time adding up the bytes to get the file size using
your method, until I just tried using the full 40 levels for q.

Anyway, I ended up successfully reading in the files in NCL and
converting to netcdf. Since the horizontal resolution is the same for
every variable, I had every record be one vertical level and then
append each level (or record) to create the 3D variables. This way,
the 2D variables were just the last few records.

Thanks!
Dave Porter

On Feb 4, 2008, at 7:04 PM, Dave Allured wrote:

> Dave Porter,
>
> Sorry for the delay in responding, I was out for a few days.
>
> You have two distinct problems to solve: Determine the exact format
> of a complex binary file with inadequate documentation, and read
> selected arrays from the file with NCL. You need to solve the file
> format first, which is technically outside the scope of NCL.
> However, since this is a common problem, I would like to add some to
> Dennis Shea's comments.
>
> One general method of file sleuthing is to add up the expected byte
> sizes of all of the arrays that are supposed to be in the binary
> file, and compare with the actual file size. The exact difference
> value is an indicator of the file format.
>
> Difference 0 implies flat binary file.
> Difference 8 implies Fortran sequential and single large record.
> Difference 8 * (number of variables) implies Fortran sequential
> and each variable in a separate record.
>
> Difference 0 is the same as Dennis's length formula for a flat file,
> I am just stating it differently.
>
> If the difference is other than one of these three, then there are
> other possibilities. There may be more or fewer variables in the
> file than expected; some of the arrays may be packed or double
> precision; the dimensions may be other than expected; there may be
> header records in the file; the record structure is something other
> than the simple possibilities above. You may be able to determine
> some of these other possibilities by manual inspection of the binary
> file.
>
> Fortran sequential binary files contain additional hidden
> information about the length of each data record. This hidden
> information can be used to determine the location and length of all
> of the records in the file, and will help to determine the contents
> of an undocumented file.
>
> The full record structure of a Fortran sequential binary file is:
>
> [Length] [Record 1 data] [Length]
> [Length] [Record 2 data] [Length]
> [Length] [Record 3 data] [Length]
> ...
> [End of file]
>
> Each length indicator is usually a single 4-byte integer that is the
> length in bytes of the data portion of the record. The prefix and
> suffix lengths are the same for each record. You can easily check
> the first record to confirm or disprove that the file is really
> Fortran sequential.
>
> If sequential, then by manual inspection you can *always* jump
> exactly from the start to end of every record in the file, and
> thereby determine the exact file offset and size of every record.
>
> If you then compare the individual detected record sizes to the
> sizes of the expected arrays, then you can start matching up
> variables to record positions in the file. This will allow you to
> separate your 2D and 3D arrays, for example.
>
> This depends on the educated guess that each data variable was
> written as a single Fortran sequential record. This is an
> assumption and is not necessarily how they did it.
>
> The Unix "od" command is helpful to inspect unknown files. You may
> first need to convert a sample big endian file to little endian for
> study, because I don't think od has an option to reverse the byte
> order.
>
> For example, this displays the start of the file as integers,
> allowing you to see the length prefix integer on the first record.
> If this integer matches the expected total size in bytes of one of
> your data arrays, then you are on the right track:
>
> od -Ad -i -N80 file.bin
>
> By adding the -j option, you can offset to get the first record
> suffix and the second record prefix. Suppose len(1) = 200, then
> offset 4 for the first prefix plus 200 for the data:
>
> od -Ad -i -N80 -j204 file.bin
>
> Alternatively, you can use NCL to search for the length integers.
> Similar to Dennis's example, this reads the entire file as a single
> 1D array of integers, then finds the start of each record by
> accumulating the offset. The data size of each record and the
> prefix and suffix integers are printed:
>
> setfileoption ("bin", "ReadByteOrder", "BigEndian")
> idat = fbindirread ("jma_file", 0, -1, "integer")
> dims = dimsizes (idat)
> offset = 0
> recnum = 0
>
> do while (offset .lt. dims(0))
> prefix = idat(offset)
> print ("array size " + recnum + " = " + prefix/4)
> print ("prefix " + recnum + " = " + prefix)
>
> offset = offset + 2 + prefix/4 ; start of next record
> suffix = idat(offset-1)
> print ("suffix " + recnum + " = " + suffix)
> recnum = recnum + 1
> end do
>
> This loop will crash with a subscript error if the file is not
> Fortran sequential, or certain other assumptions are wrong.
>
> Once you have better information about the file format, you can move
> on to actually reading selected data arrays from the file. HTH.
>
> Dave Allured
> CU/CIRES Climate Diagnostics Center (CDC)
> http://cires.colorado.edu/science/centers/cdc/
> NOAA/ESRL/PSD, Climate Analysis Branch (CAB)
> http://www.cdc.noaa.gov/
>
> Dennis Shea wrote:
>> Not sure what to say when you say "no documentation for how the
>> binary
>> was written". Given the above lack of information, it is not
>> reasonable to
>> expect any tool to automatically read the files.
>> Files: Flat or written as a fortran sequential file which has
>> extra record information silently embedded
>> Hopefully, you know the sizes of the dimension [klev1, klev2,
>> nlat, mlon].
>> Each float is 4-bytes
>> 4 (bytes) *[ 4 (variables) * mlon*nlat*klev1 * 1 (variable)
>> *mlon*nlat*klev2 +
>> 5 (variables) * mlon*nlat] = total # bytes
>> for flat binary
>> If this number matches the number of bytes in the file, you know
>> it is
>> a flat file.
>> setfileoption("bin","ReadByteOrder","BigEndian")
>> x_1d = fbindirread("jma_file", -1, "float")
>> **If** you knew the order you could do something like the
>> following
>> mn = nlat*mlon
>> nStrt = 0
>> nLast = nm-1
>> a_2d = onedtond( x_1d(nStrt:nLast), (/nlat,mlon/) )
>> nStrt = nLast+1
>> nLast = nStrt + mn
>> b_2d = onedtond( x_1d(nStrt:nLast), (/nlat,mlon/) )
>> However, you should also know the meta data if you want
>> to create a good netCDF file.
>> If the bytes do not match, it is a fortran sequential file ....
>> you would have to determine the information from the extra bytes.
>> David F Porter wrote:
>>> Dave,
>>>
>>> Well, I'm trying to read in the Japanese Reanalysis data, of which
>>> there is no documentation for how the binary was written. Most of
>>> the data is in GRIB, but monthly means are in binary. I've tried
>>> both direct and sequential access routines in NCL (attempting to
>>> just read in the first record of the file), and neither worked.
>>> But then again, I'm not positive of the order of the data in the
>>> files either.
>> Not sure what you mean ... "neither worked"
>>>
>>> To clarify, each file is for 1 time period (6-hourlies in GRIB,
>>> monthlies in binary). Each file contains 10 variables. The
>>> problem is, 4 variables are [lon,lat,lev] , one is the same but a
>>> different number of levels, and then 5 are just [lon,lat]. To
>>> make things more interesting, I'm not sure of the order that each
>>> variable was written (it is slow communicating with the JMA).
>> JMA should do a better job.
>>>
>>> I tried converting it to netCDF using IDL, which I've had success
>>> doing with other binary files by just pointing to the starting
>>> byte for each variable. I used the order of the variables in the
>>> documentation (anl_mdl listed here http://jra.kishou.go.jp/elements_en.html
>>> and also the order of variables given by the NCL function
>>> printFileVarSummary() after reading in the corresponding 6-hourly
>>> GRIB files (the same dimensions, just different time and format).
>>>
>>> Dave Porter
>>>
>>>
>>>
>>> On Jan 23, 2008, at 4:27 PM, Dave Allured wrote:
>>>
>>>> Dave,
>>>>
>>>> Can you be more specific as to the type of binary file? Fortran
>>>> "unformatted sequential access"; plain binary such as written by
>>>> Fortran direct access; or something else? I just want to be sure
>>>> I'm on the right wavelength before responding.
>>>>
>>>> Also, assuming one of the first two: So the record length varies
>>>> within each file? Does each file have its own unique layout, or
>>>> is each variable found in exactly the same position in every file?
>>>>
>>>> Dave Allured
>>>> CU/CIRES Climate Diagnostics Center (CDC)
>>>> http://cires.colorado.edu/science/centers/cdc/
>>>> NOAA/ESRL/PSD, Climate Analysis Branch (CAB)
>>>> http://www.cdc.noaa.gov/
>>>>
>>>> David F Porter wrote:
>>>>> Sorry if this has been covered, by I've exhausted the search
>>>>> function with no real results.
>>>>> I am looking to read in some large 4-Byte Float big-endian
>>>>> binary data onto my little-endian machine. The problem I am
>>>>> having is that each file corresponds to ONE time period, but
>>>>> each variable in the file has different dimensions, some 2D and
>>>>> some 3D. Because of the varying sizes, I feel that I cannot
>>>>> simply use the "record number". Also, I only want some of the
>>>>> variables (to save space after loading 300 of these files).
>>>>> I'm not sure if it matters at this point, but the variables are
>>>>> on a gaussian grid.
>>>>> Dave
>>>> _______________________________________________
>>>> ncl-talk mailing list
>>>> ncl-talk_at_ucar.edu
>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>>
>>> _______________________________________________
>>> ncl-talk mailing list
>>> ncl-talk_at_ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>
> _______________________________________________
> ncl-talk mailing list
> ncl-talk_at_ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/ncl-talk

_______________________________________________
ncl-talk mailing list
ncl-talk_at_ucar.edu
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Mon Feb 11 2008 - 13:05:25 MST

This archive was generated by hypermail 2.2.0 : Tue Feb 12 2008 - 14:45:27 MST