Re: Reading Binary files with mixed-dimensions

From: Dave Allured <dave.allured_at_nyahnyahspammersnyahnyah>
Date: Mon, 11 Feb 2008 17:41:01 -0700

Dave,

Glad to help. Were you able to distinguish the individual variables
with identical dimensions? I figured this could be done by educated
guessing based on the listing order in the incomplete documentation,
numeric ranges of the arrays, and perhaps graphical display of 2D
slices of each array.

--Dave

David F Porter wrote:
> Dave,
>
> Thanks for your extended reply. Fortunately, I have determined that it
> is a flat binary file. The limited documentation was misleading for one
> of the records (implying specific humidity was only on 28 of the levels,
> when in reality it is just unreliable above that). I was having a
> terrible time adding up the bytes to get the file size using your
> method, until I just tried using the full 40 levels for q.
>
> Anyway, I ended up successfully reading in the files in NCL and
> converting to netcdf. Since the horizontal resolution is the same for
> every variable, I had every record be one vertical level and then append
> each level (or record) to create the 3D variables. This way, the 2D
> variables were just the last few records.
>
> Thanks!
> Dave Porter
>
>
> On Feb 4, 2008, at 7:04 PM, Dave Allured wrote:
>
>> Dave Porter,
>>
>> Sorry for the delay in responding, I was out for a few days.
>>
>> You have two distinct problems to solve: Determine the exact format
>> of a complex binary file with inadequate documentation, and read
>> selected arrays from the file with NCL. You need to solve the file
>> format first, which is technically outside the scope of NCL. However,
>> since this is a common problem, I would like to add some to Dennis
>> Shea's comments.
>>
>> One general method of file sleuthing is to add up the expected byte
>> sizes of all of the arrays that are supposed to be in the binary file,
>> and compare with the actual file size. The exact difference value is
>> an indicator of the file format.
>>
>> Difference 0 implies flat binary file.
>> Difference 8 implies Fortran sequential and single large record.
>> Difference 8 * (number of variables) implies Fortran sequential
>> and each variable in a separate record.
>>
>> Difference 0 is the same as Dennis's length formula for a flat file, I
>> am just stating it differently.
>>
>> If the difference is other than one of these three, then there are
>> other possibilities. There may be more or fewer variables in the file
>> than expected; some of the arrays may be packed or double precision;
>> the dimensions may be other than expected; there may be header records
>> in the file; the record structure is something other than the simple
>> possibilities above. You may be able to determine some of these other
>> possibilities by manual inspection of the binary file.
>>
>> Fortran sequential binary files contain additional hidden information
>> about the length of each data record. This hidden information can be
>> used to determine the location and length of all of the records in the
>> file, and will help to determine the contents of an undocumented file.
>>
>> The full record structure of a Fortran sequential binary file is:
>>
>> [Length] [Record 1 data] [Length]
>> [Length] [Record 2 data] [Length]
>> [Length] [Record 3 data] [Length]
>> ...
>> [End of file]
>>
>> Each length indicator is usually a single 4-byte integer that is the
>> length in bytes of the data portion of the record. The prefix and
>> suffix lengths are the same for each record. You can easily check the
>> first record to confirm or disprove that the file is really Fortran
>> sequential.
>>
>> If sequential, then by manual inspection you can *always* jump exactly
>> from the start to end of every record in the file, and thereby
>> determine the exact file offset and size of every record.
>>
>> If you then compare the individual detected record sizes to the sizes
>> of the expected arrays, then you can start matching up variables to
>> record positions in the file. This will allow you to separate your 2D
>> and 3D arrays, for example.
>>
>> This depends on the educated guess that each data variable was written
>> as a single Fortran sequential record. This is an assumption and is
>> not necessarily how they did it.
>>
>> The Unix "od" command is helpful to inspect unknown files. You may
>> first need to convert a sample big endian file to little endian for
>> study, because I don't think od has an option to reverse the byte order.
>>
>> For example, this displays the start of the file as integers, allowing
>> you to see the length prefix integer on the first record. If this
>> integer matches the expected total size in bytes of one of your data
>> arrays, then you are on the right track:
>>
>> od -Ad -i -N80 file.bin
>>
>> By adding the -j option, you can offset to get the first record suffix
>> and the second record prefix. Suppose len(1) = 200, then offset 4 for
>> the first prefix plus 200 for the data:
>>
>> od -Ad -i -N80 -j204 file.bin
>>
>> Alternatively, you can use NCL to search for the length integers.
>> Similar to Dennis's example, this reads the entire file as a single 1D
>> array of integers, then finds the start of each record by accumulating
>> the offset. The data size of each record and the prefix and suffix
>> integers are printed:
>>
>> setfileoption ("bin", "ReadByteOrder", "BigEndian")
>> idat = fbindirread ("jma_file", 0, -1, "integer")
>> dims = dimsizes (idat)
>> offset = 0
>> recnum = 0
>>
>> do while (offset .lt. dims(0))
>> prefix = idat(offset)
>> print ("array size " + recnum + " = " + prefix/4)
>> print ("prefix " + recnum + " = " + prefix)
>>
>> offset = offset + 2 + prefix/4 ; start of next record
>> suffix = idat(offset-1)
>> print ("suffix " + recnum + " = " + suffix)
>> recnum = recnum + 1
>> end do
>>
>> This loop will crash with a subscript error if the file is not Fortran
>> sequential, or certain other assumptions are wrong.
>>
>> Once you have better information about the file format, you can move
>> on to actually reading selected data arrays from the file. HTH.
>>
>> Dave Allured
>> CU/CIRES Climate Diagnostics Center (CDC)
>> http://cires.colorado.edu/science/centers/cdc/
>> NOAA/ESRL/PSD, Climate Analysis Branch (CAB)
>> http://www.cdc.noaa.gov/
>>
>> Dennis Shea wrote:
>>> Not sure what to say when you say "no documentation for how the binary
>>> was written". Given the above lack of information, it is not
>>> reasonable to
>>> expect any tool to automatically read the files.
>>> Files: Flat or written as a fortran sequential file which has
>>> extra record information silently embedded
>>> Hopefully, you know the sizes of the dimension [klev1, klev2,
>>> nlat, mlon].
>>> Each float is 4-bytes
>>> 4 (bytes) *[ 4 (variables) * mlon*nlat*klev1 * 1 (variable)
>>> *mlon*nlat*klev2 +
>>> 5 (variables) * mlon*nlat] = total # bytes for
>>> flat binary
>>> If this number matches the number of bytes in the file, you know
>>> it is
>>> a flat file.
>>> setfileoption("bin","ReadByteOrder","BigEndian")
>>> x_1d = fbindirread("jma_file", -1, "float")
>>> **If** you knew the order you could do something like the
>>> following
>>> mn = nlat*mlon
>>> nStrt = 0
>>> nLast = nm-1
>>> a_2d = onedtond( x_1d(nStrt:nLast), (/nlat,mlon/) )
>>> nStrt = nLast+1
>>> nLast = nStrt + mn
>>> b_2d = onedtond( x_1d(nStrt:nLast), (/nlat,mlon/) )
>>> However, you should also know the meta data if you want
>>> to create a good netCDF file.
>>> If the bytes do not match, it is a fortran sequential file ....
>>> you would have to determine the information from the extra bytes.
>>> David F Porter wrote:
>>>> Dave,
>>>>
>>>> Well, I'm trying to read in the Japanese Reanalysis data, of which
>>>> there is no documentation for how the binary was written. Most of
>>>> the data is in GRIB, but monthly means are in binary. I've tried
>>>> both direct and sequential access routines in NCL (attempting to
>>>> just read in the first record of the file), and neither worked. But
>>>> then again, I'm not positive of the order of the data in the files
>>>> either.
>>> Not sure what you mean ... "neither worked"
>>>>
>>>> To clarify, each file is for 1 time period (6-hourlies in GRIB,
>>>> monthlies in binary). Each file contains 10 variables. The problem
>>>> is, 4 variables are [lon,lat,lev] , one is the same but a different
>>>> number of levels, and then 5 are just [lon,lat]. To make things
>>>> more interesting, I'm not sure of the order that each variable was
>>>> written (it is slow communicating with the JMA).
>>> JMA should do a better job.
>>>>
>>>> I tried converting it to netCDF using IDL, which I've had success
>>>> doing with other binary files by just pointing to the starting byte
>>>> for each variable. I used the order of the variables in the
>>>> documentation (anl_mdl listed here
>>>> http://jra.kishou.go.jp/elements_en.html and also the order of
>>>> variables given by the NCL function printFileVarSummary() after
>>>> reading in the corresponding 6-hourly GRIB files (the same
>>>> dimensions, just different time and format).
>>>>
>>>> Dave Porter
>>>>
>>>>
>>>>
>>>> On Jan 23, 2008, at 4:27 PM, Dave Allured wrote:
>>>>
>>>>> Dave,
>>>>>
>>>>> Can you be more specific as to the type of binary file? Fortran
>>>>> "unformatted sequential access"; plain binary such as written by
>>>>> Fortran direct access; or something else? I just want to be sure
>>>>> I'm on the right wavelength before responding.
>>>>>
>>>>> Also, assuming one of the first two: So the record length varies
>>>>> within each file? Does each file have its own unique layout, or is
>>>>> each variable found in exactly the same position in every file?
>>>>>
>>>>> Dave Allured
>>>>> CU/CIRES Climate Diagnostics Center (CDC)
>>>>> http://cires.colorado.edu/science/centers/cdc/
>>>>> NOAA/ESRL/PSD, Climate Analysis Branch (CAB)
>>>>> http://www.cdc.noaa.gov/
>>>>>
>>>>> David F Porter wrote:
>>>>>> Sorry if this has been covered, by I've exhausted the search
>>>>>> function with no real results.
>>>>>> I am looking to read in some large 4-Byte Float big-endian binary
>>>>>> data onto my little-endian machine. The problem I am having is
>>>>>> that each file corresponds to ONE time period, but each variable
>>>>>> in the file has different dimensions, some 2D and some 3D.
>>>>>> Because of the varying sizes, I feel that I cannot simply use the
>>>>>> "record number". Also, I only want some of the variables (to save
>>>>>> space after loading 300 of these files).
>>>>>> I'm not sure if it matters at this point, but the variables are on
>>>>>> a gaussian grid.
>>>>>> Dave
>>>>> _______________________________________________
>>>>> ncl-talk mailing list
>>>>> ncl-talk_at_ucar.edu
>>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>>>
>>>> _______________________________________________
>>>> ncl-talk mailing list
>>>> ncl-talk_at_ucar.edu
>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>
>> _______________________________________________
>> ncl-talk mailing list
>> ncl-talk_at_ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>
> _______________________________________________
> ncl-talk mailing list
> ncl-talk_at_ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/ncl-talk

_______________________________________________
ncl-talk mailing list
ncl-talk_at_ucar.edu
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Mon Feb 11 2008 - 17:41:01 MST

This archive was generated by hypermail 2.2.0 : Fri Feb 15 2008 - 17:17:57 MST