Re: different size netCDF files generated from NCL on different machines

From: Dave Allured - NOAA Affiliate <dave.allured_at_nyahnyahspammersnyahnyah>
Date: Thu Jun 19 2014 - 17:51:56 MDT

Marc,

Never mind, you already gave the dimensions and I missed it. Also, I was
overthinking the aspect of fitting chunks into arrays, and so on. I do not
think this has much to do with file sizes, except for chunks that are much
too small. That is not your current situation.

A different aspect of deciding chunk sizes is optimizing read access time.
 Netcdf files are usually written once, then read many times by various
users.

In cases where the most common access is reading whole grids, it is my
opinion that chunking single complete grids is optimal. In your case that
would be (1, nlats, nlons) i.e. (1, 106, 123). I would suggest trying
this, and see whether your resulting file size is acceptable.

--Dave


On Thu, Jun 19, 2014 at 4:58 PM, Dave Allured - NOAA Affiliate <
dave.allured@noaa.gov> wrote:

> Marc,
>
> You should also look into the possibility of bad fit of the default chunk
> sizes. I am suspicious because of the excessive discrepancy, 5.3 Gb vers=
us
> 9.0 Gb. Simple "tuning" should not create a difference this large.
>
> What are your actual time, lat, and lon dimension sizes in this file?
>
> --Dave
>
>
> On Thu, Jun 19, 2014 at 4:22 PM, Wei Huang <huangwei@ucar.edu> wrote:
>
>> Marc,
>>
>> As you mentioned in your email, filechunkdimdef is how to define
>> chunksize (file-wise). see:
>> http://www.ncl.ucar.edu/Document/Functions/Built-in/filechunkdimdef.shtm=
l
>> filevarchunkdef is to define chunksize for a varaible, see:
>> http://www.ncl.ucar.edu/Document/Functions/Built-in/filevarchunkdef.shtm=
l
>>
>> Another big factor is compression. For variable compression see:
>>
>> http://www.ncl.ucar.edu/Document/Functions/Built-in/filevarcompresslevel=
def.shtml
>> File-wise compressions, also shuffle can play a role as well, see:
>> http://www.ncl.ucar.edu/Document/Functions/Built-in/setfileoption.shtml
>> (goto CompressionLevel, and Shuffle)
>>
>> Wei
>>
>>
>> On Thu, Jun 19, 2014 at 2:55 PM, Marcella, Marc <
>> MMarcella@air-worldwide.com> wrote:
>>
>>> Thanks Wei. I cant seem to get specifying the actual chunk size to
>>> work via the filevarchunkdef or filechunkdimdef. My file dimensions ar=
e
>>> time, lat, lon at 8760,106,123.
>>>
>>> For NCL6.0 this makes a 5.3GB file at chunksizes 957,21,12 whereas on
>>> NCL6.1.2 this makes a 9.0GB file at chunksizes 1752,18,25.
>>>
>>> I cant really seem to find a “scale factor” that is con=
sistent across
>>> dimensions or NCL versions that converts from the dimsizes to the chunk=
s.
>>> Only the 8760 to 1752 seems to have a neat value of (1/5).
>>>
>>> Is there somewhere in the NCL code or libraries that I can specify/set
>>> the chunk sizes so I may see if this is indeed what is causing the
>>> differences between the two file sizes?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> -Marc
>>>
>>>
>>>
>>> *From:* Wei Huang [mailto:huangwei@ucar.edu]
>>> *Sent:* Thursday, June 19, 2014 4:48 PM
>>> *To:* Dave Allured - NOAA Affiliate
>>> *Cc:* Marcella, Marc; ncl-talk@ucar.edu
>>> *Subject:* Re: [ncl-talk] different size netCDF files generated from
>>> NCL on different machines
>>>
>>>
>>>
>>> Chunk size will definitely impact the file size, and compression too.
>>>
>>> If chunk size is bigger than the real data size, the file will be end u=
p
>>>
>>> with the chunk size as the file system has to allocate enough space
>>>
>>> for the specified chunk. If chunk size is too small, the real data will
>>>
>>> be stored in many small chunks, which causes file size larger as
>>>
>>> the overhead of storing the chunk info.
>>>
>>>
>>>
>>> This case help all of us to understand the chunking better.
>>>
>>>
>>>
>>> Regards,
>>>
>>>
>>>
>>> Wei
>>>
>>>
>>>
>>> On Thu, Jun 19, 2014 at 1:47 PM, Dave Allured - NOAA Affiliate <
>>> dave.allured@noaa.gov> wrote:
>>>
>>> Marc,
>>>
>>>
>>>
>>> Please include the user list in all replies.
>>>
>>>
>>> I am glad you found the discrepancy. When first creating a Netcdf-4
>>> file in NCL, the user may optionally set chunk sizes with the function
>>> filevarchunkdef.
>>>
>>>
>>>
>>> If you don't set your own chunk sizes, I think NCL has a built in metho=
d
>>> to compute default chunk sizes. That might explain your current result=
s.
>>> Perhaps NCL support could explain this part.
>>>
>>>
>>>
>>> --Dave
>>>
>>> On Thu, Jun 19, 2014 at 12:18 PM, Marcella, Marc <
>>> MMarcella@air-worldwide.com> wrote:
>>>
>>> Hi Dave,
>>>
>>>
>>> Thank you for the email back. I was about to reply back when I did
>>> find, thanks to your help, a difference after ncdump –sh=
.the chunksizes.
>>>
>>>
>>>
>>> On the machine with the smaller file size all of the variables read lik=
e
>>> this example one, T2:
>>>
>>> float T2(Time, south_north, west_east) ;
>>>
>>> T2:_FillValue = 9.96921e+36f ;
>>>
>>> T2:units = "K" ;
>>>
>>> T2:description = "Temperature at 2 m" ;
>>>
>>> T2:_Storage = "chunked" ;
>>>
>>> T2:_ChunkSizes = 957, 21, 12 ;
>>>
>>> T2:_DeflateLevel = 5 ;
>>>
>>> T2:_Shuffle = "true" ;
>>>
>>>
>>>
>>> But on the machine with the larger file size, the same variable (and al=
l
>>> others) has a larger chunksize (957, 21, 12 vs. 1752, 18,25):
>>>
>>> float T2(Time, south_north, west_east) ;
>>>
>>> T2:_FillValue = 9.96921e+36f ;
>>>
>>> T2:units = "K" ;
>>>
>>> T2:description = "Temperature at 2 m" ;
>>>
>>> T2:_Storage = "chunked" ;
>>>
>>> T2:_ChunkSizes = 1752, 18, 25 ;
>>>
>>> T2:_DeflateLevel = 5 ;
>>>
>>> T2:_Shuffle = "true" ;
>>>
>>>
>>>
>>> Im assuming this is the problem. I started to read up on chunksize but
>>> thought I would email you back to see if you think this would in effect=
 be
>>> causing the difference. And, it seems like I can simply specify the
>>> chunksize somewhere (guessing its in the netcdf libraries). Anyhow, wa=
nted
>>> to get your input on it. Thanks again for your help with this, please =
do
>>> let me know!
>>>
>>>
>>>
>>> -Marc
>>>
>>>
>>>
>>>
>>>
>>> Marc,
>>>
>>>
>>>
>>> An important clue is where you said different results when run on
>>> command line versus shell script. You might be launching a different N=
CL
>>> version through the shell script. Try ncl -V (display version number) =
on
>>> command line and in script.
>>>
>>>
>>>
>>> Look for other differences. First compare ncdump -k between differing
>>> files, just in case your format assumption is wrong. Then compare ncdu=
mp
>>> -hs, line by line. In particular, compare dimension sizes and compress=
ion
>>> parameters. Also look for missing variables.
>>>
>>>
>>>
>>> Check for unexpected program abort for the process that makes the
>>> smaller file. It is possible that you have a partially written file th=
at
>>> is still valid Netcdf4/HDF5, but partially empty data. Such file could
>>> possibly match in terms of variables and dimension size, but be physica=
lly
>>> smaller.
>>>
>>>
>>>
>>> --Dave
>>>
>>>
>>>
>>> On Wed, Jun 18, 2014 at 2:45 PM, Marcella, Marc <
>>> MMarcella@air-worldwide.com> wrote:
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I am finding a peculiar instance that when an identical netcdf file is
>>> written via NCL (NetCDF 4 Classic, compression level 5,) on one machine=
 the
>>> file is 5.6GB but on the other computer its size is 9GB. To the best o=
f my
>>> knowledge, the files are identical when opened (type, values, etc.) Wh=
at
>>> is odder is that if I execute the NCL script (and ensure they are using=
 the
>>> same version of NCL) from the command line instead of from the shell
>>> script, the file sizes are identical.
>>>
>>>
>>>
>>> Are there certain checks I am missing that I should try? I believe both
>>> machines are using the same NetCDF libraries, but could that be the iss=
ue?
>>> Anyhow, I am somewhat stumped and believe it must be a machine environ=
ment
>>> issue difference when both machines execute the shell script that calls=
 the
>>> NCL script to generate the netcdf. Any help you could lend would be
>>> greatly appreciated…thanks!
>>>
>>>
>>>
>>> -Marc
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ncl-talk mailing list
>>> List instructions, subscriber options, unsubscribe:
>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>>
>>>
>>>
>>
>>
>

Received on Thu Jun 19 11:52:09 2014

This archive was generated by hypermail 2.1.8 : Wed Jul 23 2014 - 15:33:46 MDT