Re: Reading large grib2 files

From: David Brown <dbrown_at_nyahnyahspammersnyahnyah>
Date: Fri, 14 Aug 2009 14:52:34 -0600

Hi Andrea,

Although NCL's GRIB 1 reader can handle files greater than 2GB, the
GRIB2 reader cannot. The reason is that the GRIB2 reader uses the
NCEP library g2clib to do the low-level I/O on the GRIB data. The
NCEP provided utility wgrib2 has the same limitation. Although we
could modify the library to support 64-bit reads, we are reluctant to
make changes to external libraries such as this because it makes it
more difficult to adapt newer versions of the library when they are
released. However, we may need to reconsider this now that there is
evidence of archived GRIB data that exceeds the 32-bit boundary.

In the meantime, a workaround is to split these files into two parts,
neither of which exceed the limit. I managed to do this successfully
with the file you reference below using the Unix 'split' command. The
trick is to make the split between records. If you have the wgrib2
tool (http://www.cpc.ncep.noaa.gov/products/wesley/wgrib2/) it is
pretty straightforward. Although wgrib2 cannot decode the file past
the 2GB boundary, it still works fine for records within the limit
because it just decodes the records as they appear. Running it
without command line options on this file produces output like this:

wgrib2 GFS_Global_0p5deg_20081108_1200.grib2

1:0:d=2008110812:TMP:10 mb:anl:
2.1:57012:d=2008110812:UGRD:10 mb:anl:
2.2:57012:d=2008110812:VGRD:10 mb:anl:
3.1:192581:d=2008110812:UGRD:20 mb:anl:
3.2:192581:d=2008110812:VGRD:20 mb:anl:
4.1:456199:d=2008110812:UGRD:30 mb:anl:
4.2:456199:d=2008110812:VGRD:30 mb:anl:
5.1:716851:d=2008110812:UGRD:50 mb:anl:
5.2:716851:d=2008110812:VGRD:50 mb:anl:
6.1:990684:d=2008110812:UGRD:70 mb:anl:
....
12657:2146660084:d=2008110812:ABSV:250 mb:129 hour fcst:
12658:2146794841:d=2008110812:TMP:500 mb:129 hour fcst:
12659:2146867791:d=2008110812:VVEL:450 mb:129 hour fcst:
12660:2147181598:d=2008110812:CLWMR:450 mb:129 hour fcst:
12661.1:2147239565:d=2008110812:UGRD:500 mb:129 hour fcst:
12661.2:2147239565:d=2008110812:VGRD:500 mb:129 hour fcst:

This is where it stops because it has reached 2GB. The number before
the first colon is the record number (with field number after the
decimal point if it is a multi-field record). The second number is
the offset into the file of the beginning of the record. My first
attempt to use split:

split -b 2147239564 GFS_Global_0p5deg_20081108_1200.grib2

did not work. It produced a single file that was just a copy of the
original. I speculate that split wants to be able to create at least
2 files of the specified size. Also I had subtracted 1 from the value
of the record, but because the offset value is 0-based that was not
correct. So I looked through the wgrib2 output for an offset that was
just a bit less than halfway through the file and tried again:

split -b 1513331682 GFS_Global_0p5deg_20081108_1200.grib2

This worked, producing 3 files called xaa, xab, and xac. xac was
small, and of course there is no guarantee that the second split is
on a record boundary. But that is easily solved using cat to put the
second two files together.

cat xac >> xab

Now xaa and xab are valid GRIB2 files that NCL can handle and they
contain all the records that were in the original file.

If you cannot get access to wgrib2, you could still find the record
boundaries using the fact that GRIB (1 and 2) records begin with the
characters 'GRIB' and end with the characters '7777', so record
boundaries typically have the characters 7777GRIB together. On my
mac, I can track these down with the following Unix command:

od -Ad -c GFS_Global_0p5deg_20081108_1200.grib2 | grep '7 7 G R'

producing output like this:
...
156331280 k W 242 033 213 002 207 377 331 7 7 7 7 G
R I
156543888 G 031 354 301 245 g 346 177 377 331 7 7 7 7
G R
157042784 F 277 377 331 7 7 7 7 G R I B \0 \0
\0 002
157196400 377 331 7 7 7 7 G R I B \0 \0 \0 002
\0 \0
157420336 336 177 377 331 7 7 7 7 G R I B \0 \0
\0 002
157578352 \r x 253 F 026 \t k 177 377 331 7 7 7 7
G R

where the number at the beginning is the offset of the first
character in each line. Note that od -c places each byte into 4
spaces. This method seems substantially slower than using wgrib2 though.

This is probably more than you want to know but I hope it helps.
  -dave

On Aug 14, 2009, at 8:22 AM, Andrea Hahmann wrote:

> Hi Dennis and Dave
>
> Here is an example:
>
> file: /DSS/ds335.0/GFS0p5/GFS_Global_0p5deg_20081108_1200.grib2
>
> I already have a local copy on disk at gale:/ptmp/hahmann/GFS/
> 20081108/
>
> BTW, the script that is supposed to do this is called
> /ptmp/hahmann/GFS/readUV.ncl
>
> Thanks,
> Andrea
> --
>
> Andrea N. Hahmann
>
> Wind Energy Division
> Risø DTU
>
> Technical University of Denmark – DTU
> Risø National Laboratory for Sustainable Energy
> Frederikborgvej 399, P.O. Box 49
> Building 125
> DK-4000 Roskilde, Denmark
>
> Direct +45 4677 5471
> Fax +45 4677 5083
> ahah_at_risoe.dk
> www.risoe.dtu.dk
>
>
>
>> From: Dennis Shea <shea_at_ucar.edu>
>> Date: Fri, 14 Aug 2009 07:44:48 -0600
>> To: "Andrea N. Hahmann" <ahah_at_risoe.dtu.dk>
>> Cc: David Brown <dbrown_at_ucar.edu>
>> Subject: Re: Reading large grib2 files
>>
>> Hi Andrea,
>>
>> This is offline.
>>
>> Dave Brown will have to look into this.
>> He is the developer who all the 'grib stuff'.
>>
>> Can you tell us what directory the GRIB data are located?
>>
>> Is there a specific example of a file that "hangs"?
>>
>> Cheers
>> D
>>
>>
>> Andrea Hahmann wrote:
>>> Dear NCLelers,
>>>
>>> I have been running into troubles when reading large (> 2GB)
>>> grid2 files
>>> with NCL. The data comes from the 0.5 x 0.5 degree GFS model output
>>> from the NCAR MSS (ds335.0/GFS0p5). The NCL reads a single file,
>>> extracts the analysis (single time) 0.995 sigma level U and V winds,
>>> computes the wind speed, and writes a single NetCDF file
>>> containing the
>>> resulting output. Most times, the script takes 2-3 minutes to
>>> execute.
>>> Other times it just hangs without returning any errors. Any
>>> ideas of
>>> what I can do to improve the performance? All computations are
>>> done on
>>> the CISL data processing servers gale, gust, or breeze.
>>>
>>> Thanks in advance,
>>> Andrea
>>> --
>>>
>>> *Andrea N. Hahmann
>>> *
>>> Wind Energy Division
>>> Risø DTU
>>>
>>> *Technical University of Denmark – DTU
>>> *Risø National Laboratory for Sustainable Energy
>>> Frederikborgvej 399, P.O. Box 49
>>> Building 125
>>> DK-4000 Roskilde, Denmark
>>>
>>> Direct +45 4677 5471
>>> Fax +45 4677 5083
>>> ahah_at_risoe.dk
>>> www.risoe.dtu.dk
>>>
>>>
>>> --------------------------------------------------------------------
>>> ----
>>>
>>> _______________________________________________
>>> ncl-talk mailing list
>>> List instructions, subscriber options, unsubscribe:
>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>
>

_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Fri Aug 14 2009 - 14:52:34 MDT

This archive was generated by hypermail 2.2.0 : Fri Aug 14 2009 - 23:36:52 MDT