Re: Script crashes while computing WRF diagnostics (SLP, MDBZ, TD2)

From: Mary Haley <haley_at_nyahnyahspammersnyahnyah>
Date: Fri Jan 21 2011 - 15:50:03 MST

Hi Joe,

What version of NCL are you running?

There could be some internal memory being grabbed that you won't see via list_vars. As you alluded to below,
it is likely one of the wrf_xxx functions, as they have to internally create a copy of your input arrays to convert
them to double precision (since they come in as float). Also, in order to calculate "slp", you have to grab roughly
6 variables off the file, all of which are mutiple dimension arrays of size 30 x 515 x 380. This may start to
add up considering you have to convert some of these to double precision.

However, these arrays should be getting freed up after you calculate each quantity.

Would you be able to provide us with your latest script and data files so we can try running it here? You can use our ftp
site (see http://www.ncl.ucar.edu/report_bug.shtml for ftp info), or put the files on your own site and we'll get them.

The actual WRF computations are done in various Fortran files in $NCARG/ni/src/lib/nfpfort. The wrf_slp one is
in wrf_user.f. I've attached the file here.

--Mary

On Jan 21, 2011, at 3:00 PM, Joseph Zambon wrote:

> Mary,
>
> I'm not sure how NCL reserves memory, but I think there may be a "leak" somewhere while the loop is iterating.
>
> When NCL starts, it reserves 5936kb of physical memory, 170mb of virtual memory...
> 6283 jbzambon 16 0 107m 5936 4060 S 0.0 0.1 0:00.02 ncl
>
> When I run the script to just before running the loop, its using 69mb/551mb...
> 6283 jbzambon 15 0 170m 69m 5532 S 0.0 1.8 0:00.51 ncl
>
> When I execute the loop for 5 iterations, its using 551mb/450mb...
> 6283 jbzambon 25 0 551m 450m 5840 S 0.0 11.4 0:58.14 ncl
>
> When I run a list_vars command, I get the following variables...
> ncl 52> list_vars
> string in_file [ 1 ]
> integer time [ 1 ]
> string var_names_time [ 1 ]
> integer south_north [ 1 ]
> string maxtime [ 673 ]
> _FillValue
> description
> integer DateStrLen [ 1 ]
> integer bottom_top [ 1 ]
> string dim_names [ 5 ]
> integer west_east [ 1 ]
> string dim_names_time [ 2 ]
> logical dimUnlim [ 5 ]
> integer b [ 3 ]
> float T [ bottom_top | 30 ] x [ south_north | 515 ] x [ west_east | 380 ]
> FieldType
> MemoryOrder
> description
> units
> stagger
> coordinates
> string dim_names_3D [ 3 ]
> string out_file [ 1 ]
> string var_types_3D [ 10 ]
> string var_types_time [ 1 ]
> string var_names_3D [ 10 ]
> integer dim_sizes [ 5 ]
> integer ntimes [ 1 ]
>
> The only substantial source of memory listed is T (30x515x380)...
> 30X515X380 = 5,871,000 floating point integers * 32bits/float = 187,872,000bits * 8,388,608bits/Mb = 22.39Mb
>
> When I delete that float, I get the following memory usage., its using 528mb/428mb
> 6283 jbzambon 15 0 528m 428m 5844 S 0.0 10.8 0:58.17 ncl
> 551mb-528mb=23mb / 450mb-428mb=22mb, so this makes sense...
>
> Somewhere in the iteration of the loop though, memory is being reserved that is not being reported by list_vars, so I can't delete anything in order to reduce memory usage.
>
> I've tried using the last version of the code, as well as your suggestion to use -1 instead of running a loop. I replaced my loop, as you suggested with
> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
> When I monitor memory usage while using your suggestion. THe amount of reserved memory fluctuates between 1.6gb and 2.8gb...
> 6351 jbzambon 18 0 2954m 714m 5496 R 50.0 18.1 0:04.54 ncl
> 6351 jbzambon 18 0 2954m 1.6g 5496 D 51.5 41.0 0:13.29 ncl
> 6351 jbzambon 18 0 2954m 1.9g 5496 D 61.8 49.9 0:16.73 ncl
> 6351 jbzambon 18 0 2954m 2.3g 5496 R 63.3 59.5 0:20.48 ncl
> 6351 jbzambon 18 0 2954m 2.5g 5496 R 67.7 63.8 0:22.52 ncl
> 6351 jbzambon 18 0 2954m 2.6g 5496 D 63.4 68.3 0:24.43 ncl
> 6351 jbzambon 17 0 2954m 2.2g 6760 D 10.3 58.0 0:30.15 ncl
> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
>
> After about 35s of running, I get a segmentation fault....
> Segmentation fault (core dumped)
>
> If I try to open the netCDF file to see what frame it died, I am unable...
> ncdump -h /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc
> ncdump: /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc: Not a netCDF file
>
> I haven't yet worked at predefining the netCDF file yet, I'm guessing that has to do with I/O efficiency and wouldn't be a cause of this issue.
>
> Also, where are the functions which WRFUserARW.ncl calls located? For example:
> tk = wrf_tk( P , T ) ; calculate TK
> slp = wrf_slp( z, tk, P, QVAPOR ) ; calculate slp
> I'm guessing its somewhere in the ncl source as its not in:
> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/gsn_code.ncl"
> load "$NCARG_ROOT/lib/ncarg/nclscripts/wrf/WRFUserARW.ncl"
> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/contributed.ncl"
> And I can't see anything in wrfW.c that does that actual computation.
>
> If you have any other ideas, please let me know. Thanks for your suggestions and help!
>
> -Joe
>
>
>
>
> On Jan 21, 2011, at 12:57 PM, Mary Haley wrote:
>
>> Hi Joseph,
>>
>> I'm guessing you are simply running out of memory. Once you are done with a do loop,
>> you might try deleting any variables you don't need before you proceed to the next do loop.
>>
>> Also, have you tried getting rid of the "do" loops and using "-1" as the last argument to
>> wrf_user_getvar? This allows you to read in all time steps for that variable without
>> having to loop.
>>
>> What I'm thinking about may look like your third attempt, except without the do loop:
>>
>>> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
>>> ncf->mdbz = wrf_user_getvar(a,"mdbz",-1) ;MDBZ
>>> ncf->td2 = wrf_user_getvar(a,"td2",-1) ;TD2
>>
>> If you predefine the NetCDF file that you're writing to, this will help your code be far more efficient,
>> especially if you are writing a lot of data to the file.
>>
>> Please see our examples page on this:
>>
>> http://www.ncl.ucar.edu/Applications/method_2.shtml
>>
>> Let me know if this doesn't help and we'll try something else.
>>
>> --Mary
>>
>> On Jan 20, 2011, at 2:35 PM, Joseph Zambon wrote:
>>
>>> NCL Users,
>>>
>>> I have a script that runs through a wrfout file and pulls out diagnostics as well as some variables I need to use. The variables are removed without a problem, however when I run this script for larger domains, the diagnostics appear to crash the system. I've been running this on a serial compute node as I've crashed the login node a few times (to the aggravation of other users, I'm sure) while trying to debug this.
>>>
>>> Attached is my script as it is right now (ncl_out.ncl). I've tried a coupled of different approaches to running the diagnostics for a domain that is 379(w_e) x 514 (s_n) and 673 timesteps. Using the diagnostic SLP and MDBZ as examples...
>>>
>>> Version 1:
>>> Original use was to create an array (ar_3d), save SLP to array at every time step, save array to netCDF file, then overwrite the array for the next variable (MDBZ). This way I was only using one large array (379x514x673) to pull out the variables. This script would crash after ~120 timesteps (not sure exactly where as it doesn't save the array until the end and would not print out the times right before dying). This way was probably the least efficient use of memory.
>>> *********************************************************************************
>>> ;create diagnostics array
>>> ar_3d = new( (/ ntimes, south_north, west_east /), float)
>>>
>>> ;Diagnostics
>>> do time = 0,ntimes-1,1 ;SLP
>>> slp = wrf_user_getvar(a,"slp",time) ; Sea level pressure in hPa
>>> ar_3d(time,:,:)=slp
>>> print(time)
>>> end do
>>> ncf->slp = ar_3d
>>>
>>> do time = 0,ntimes-1,1 ;MDBZ
>>> mdbz = wrf_user_getvar(a,"mdbz",time) ; Simulated radar refl.
>>> ar_3d(time,:,:)=mdbz
>>> print(time)
>>> end do
>>> ncf->mdbz = ar_3d
>>> *********************************************************************************
>>>
>>> Version 2:
>>> I assumed that this array was getting too large and was choking my memory. My next attempt was to save the output directly to the netCDF file and not use an array in the middle. This script would crash after 234 timesteps (the netCDF file wrote 234 frames and then died while processing the 235th).
>>> *********************************************************************************
>>> ;compute diagnostics and export to netCDF file
>>> do time = 0,ntimes-1,1 ;SLP
>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/)
>>> print(time)
>>> end do
>>>
>>> ;compute diagnostics and export to netCDF file
>>> do time = 0,ntimes-1,1 ;MDBZ
>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/)
>>> print(time)
>>> end do
>>> *********************************************************************************
>>>
>>> Version 3:
>>> My last attempt was to bunch all of the diagnostics within the same time loop. I'm not sure how efficient this is for I/O, it didn't seem to take much longer and it saved all 3 diagnostic variables to the netCDF file. This one stopped at 241 timesteps.
>>> *********************************************************************************
>>> ;compute diagnostics and export to netCDF file
>>> do time = 0,ntimes-1,1
>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/) ;SLP
>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/) ;MDBZ
>>> ncf->td2(time,:,:) = (/wrf_user_getvar(a,"td2",time)/) ;TD2
>>> print(time)
>>> end do
>>> *********************************************************************************
>>>
>>> If anyone has any advice on why my script might be crashing, it would be greatly appreciated. I'm chasing the path down memory allocation (should be a max of 4Gb per node), although I may be way off...
>>>
>>> CPU time : 1882.70 sec.
>>> Max Memory : 3256 MB
>>> Max Swap : 11190 MB
>>>
>>> Max Processes : 10
>>> Max Threads : 12
>>>
>>> Also attached is the text outfile (ncl.out) from my last analysis, they're all more or less the same (except the length of analysis). As shown in the outfile, I'm using NCL v. 5.2.0.
>>>
>>> Of course, any general critiques on how to make this NCL script more efficient would also be greatly appreciated.
>>>
>>> Please let me know if I neglected to mention anything important. Thanks!
>>>
>>> -Joe
>>>
>>>
>>>
>>> Joseph B. Zambon
>>> jbzambon@ncsu.edu
>>> NC State University
>>> Department of Marine, Earth and Atmospheric Sciences
>>>
>>>
>>> <ncl_out.ncl.txt>
>>>
>>> <ncl.out>
>>>
>>> _______________________________________________
>>> ncl-talk mailing list
>>> List instructions, subscriber options, unsubscribe:
>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>
>

_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Fri Jan 21 15:50:10 2011

This archive was generated by hypermail 2.1.8 : Tue Jan 25 2011 - 14:22:15 MST