Re: large netCDF file writing

From: David Ian Brown <dbrown_at_nyahnyahspammersnyahnyah>
Date: Wed, 18 Oct 2006 12:11:56 -0600

Comments interspersed below.
On Oct 18, 2006, at 10:02 AM, Josh Hacker wrote:

> Dennis,
>
> You are pretty much on (clarified below). Although it is a crazy
> thing
> to try and deal with files this big, I did not see a way around it for
> my particular purpose. I have since tried something different that
> gives me the efficiency I need.
>
> Of course the trade-off is always computations versus storage, and I
> was
> opting for the storage in this case in the hopes of minimizing future
> computations.
>
> The ultimate goal is to write "patched" WRF restart files (one per
> processor) for thousands of processors. The approach is needed to
> avoid
> the memory footprint associated with a single processor allocating a
> huge domain.
>
> This requires assembling part or all of a variable from some patched
> files, say 1024, and then writing them out in a different
> configuration,
> say 4096. The assembling takes quite a while, so I wanted to do it
> only
> once and write to a big file. I can then read from the big file and
> rip
> it up accordingly. Instead, I settled on binary write/reads to/from
> individual files, each containing a single variable array. I should
> have thought of this a while ago but for whatever reason...
>
> The netCDF operator ncks has a possible bug on 64-bit machines, that
> results in a seg fault when a hyperslab is asked for, so I couldn't use
> those.
>
> Ultimately, parallel hdf5 is the solution to the problem because then
> the WRF can deal with only one file while not using too much memory,
> but
> it is not yet ported to the platform I am trying to use.
>
> If it is still useful, see below for clarification...
>
> Dennis Shea wrote:
>>> I want to write arrays to a large file without waiting until next
>>> week.
>>
>>
>> ===
>> Of course, as the Unidata people themselves say,
>> "netCDF software is designed for robustness and flexibility,
>> not efficiency"
>> ===
>>
>>
>>> Although this may be a crazy thing to do, I have a good reason to
>>> try.
>>> The file size is about 46G, and I am writing 2250x2250 2D arrays and
>>> 2250x2250x100 3D arrays.
>>>
>>> The file has already been created with ncgen. The dimensions and
>>> variables are thus pre-defined. No attributes are written, e.g.
>>> (/*/).
>>> Although I didn't believe most of these would work, I have tried the
>>> following with no noticeable difference:
>>>
>>> 1. Reversing the order of my variable list, which I am looping
>>> through
>>> as I write each variable.
>>
>>
>> ===
>> If the file is "predefined", then the underlying Unidata
>> software is [I speculate] using 'fseek' to go to a specific
>> predefined file location. It seems to me the write order
>> should not matter [much].
>> ===
>>
>>> 2. Setting the file option "PreFill" to false.
>>>
>>
>> ===
>> Yes, this should speed the file creation process.
>> However, has already been done in the "ncgen -x" process?
>> ===
>>
>>
>>> 3. Copying my output array, which is a subset of an internal NCL
>>> array, to a temporary variable for write.
>>>
>>> Thoughts, ideas?
>>>
>>> versions and platform:
>>> ---------------------
>>> NCAR Command Language Version 4.2.0.a033
>>> netCDF 3.6.0
>>> uname -m: x86_64
>>> ---------------------
>>
>>
>> I am sure I am not understanding this correctly. Pls clarify.
>> My understanding is:
>>
>> [1] You have used "ncgen" to create a file template based
>> on a CDL file. Using "ncgen -x" [ie: no prefill] results
>> in much faster file creation.
>>
>> [a] Is this done independent of NCL? Eg, Invoking ncgen
>> from the command line?
>>
>> ncgen -x .....
>>
>
>
> Command line.
>
>
>>
>> [b] You are invoking ncgen from within an NCL script
>> via the 'system' command. Hence, components
>> of the ascii CDL file used by ncgen are generated within
>> an NCL script.
>>
>> [2] You are the opening the file template created in [1] in NCL
>> with the "w" option on addfile
>>
>> fout = addfile("ncgen_template_file.nc", "w")
>>
>> I speculate this will take time to open.
>> NCL is just invoking the underlying Unidata software.
>
>
> Yes, although the open does not seem to take long.

Right. The header information is all at the front of the file so it
doesn't
matter how big the file is for NCL to read and cache all the information
about dimensions, variable names and sizes, and attribute names and
values.

>
>
>>
>> [3] Now you want to write to the file:
>>
>> fout->A(nl,:,:) = (/ a /) ; a(2250,2250)
>>
>> or maybe
>>
>> fout->B = (/ b /) ; b(100,2250,2250)
>>
>
> You got it.

I don't know if this applies here but one potentially troublesome and
perhaps
not well-known point about using the '(/ var /)' assignment is that
while it
generally means that attributes are not copied, the '_FillValue'
attribute is
an exception (necessary if you think about it) and is copied. If your
originally
defined file has variables without the _FillValue attribute this could
cause
a huge loss of performance because all the previously defined data is
moved
to make room for a new attribute at the beginning. This may occur
whether or
not you originally defined the variables with 'PreFill' set False (or
the equivalent
using ncgen or any other tool). When the file is read in NCL and the
NetCDF library
can make no assumptions about whether the data is valid or not.

  You can get around this
(1) by making sure all variables in the file have a _FillValue
attribute to begin with,
(2) explicitly deleting the _FillValue attribute from your in-memory
variables
if you really don't want them, or
(3) using the 'HeaderReserveSpace' option to create enough extra room
in the header
to accommodate the _FillValue attributes for each variable without
requiring the data
to be moved.

However, if the file-writing still seems slow, my guess is you are just
encountering
the fundamental speed limits of writing data using the NetCDF library.
If NCL is
much slower than some other app that uses the NetCDF library then I'd be
interested in investigating further.
  -dave

>
> As always, thanks for the reply.
>
> Josh
>
> --
> Joshua Hacker
> Research Applications Laboratory
> NCAR/UCAR
> email: hacker_at_ucar.edu
> voice: 303-497-8188
> fax: 303-497-8401
>
> _______________________________________________
> ncl-talk mailing list
> ncl-talk_at_ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/ncl-talk

_______________________________________________
ncl-talk mailing list
ncl-talk_at_ucar.edu
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Wed Oct 18 2006 - 12:11:56 MDT

This archive was generated by hypermail 2.2.0 : Sat Oct 21 2006 - 07:29:03 MDT