Re: Script crashes while computing WRF diagnostics (SLP, MDBZ, TD2)

From: Mary Haley <haley_at_nyahnyahspammersnyahnyah>
Date: Mon Jan 24 2011 - 08:48:59 MST

On Jan 21, 2011, at 4:51 PM, Gus Correa wrote:

> Hi Mary and Joe (and Dennis, if he's around)
>
> Some time ago I suffered from a similar memory failure/leak,
> where my NCL scripts would progressively slow down,
> and take up all memory.
> This was with NCL 5.1.1.
>
> Back then the problem was traced to printing messages
> from inside a loop.
> Dennis pointed out that memory allocated to hold the message strings
> was not being freed by NCL (if I remember right).
> Mary and Dennis may remember this.
>
> Once the print commands were commented out, the problem disappeared.
>
> Could this be Joe's problem?

It's certainly possible. It depends on the nature of the strings being used, and how
many of them there are. This situation was improved a little, but there is still
an issue.

If you are doing a lot of looping, and printing out strings inside the loop, this
can quickly add up to a problem.

--Mary

>
> My two cents,
> Gus Correa
>
> Mary Haley wrote:
>> Hi Joe,
>>
>> What version of NCL are you running?
>>
>> There could be some internal memory being grabbed that you won't see via
>> list_vars. As you alluded to below,
>> it is likely one of the wrf_xxx functions, as they have to internally
>> create a copy of your input arrays to convert
>> them to double precision (since they come in as float). Also, in order
>> to calculate "slp", you have to grab roughly
>> 6 variables off the file, all of which are mutiple dimension arrays of
>> size 30 x 515 x 380. This may start to
>> add up considering you have to convert some of these to double precision.
>>
>> However, these arrays should be getting freed up after you calculate
>> each quantity.
>>
>> Would you be able to provide us with your latest script and data files
>> so we can try running it here? You can use our ftp
>> site (see http://www.ncl.ucar.edu/report_bug.shtml for ftp info), or put
>> the files on your own site and we'll get them.
>>
>> The actual WRF computations are done in various Fortran files in
>> $NCARG/ni/src/lib/nfpfort. The wrf_slp one is
>> in wrf_user.f. I've attached the file here.
>>
>>
>> --Mary
>> =
>> ------------------------------------------------------------------------
>>
>>
>> On Jan 21, 2011, at 3:00 PM, Joseph Zambon wrote:
>>
>>> Mary,
>>>
>>> I'm not sure how NCL reserves memory, but I think there may be a
>>> "leak" somewhere while the loop is iterating.
>>>
>>> When NCL starts, it reserves 5936kb of physical memory, 170mb of
>>> virtual memory...
>>> 6283 jbzambon 16 0 107m 5936 4060 S 0.0 0.1 0:00.02 ncl
>>>
>>> When I run the script to just before running the loop, its using
>>> 69mb/551mb...
>>> 6283 jbzambon 15 0 170m 69m 5532 S 0.0 1.8 0:00.51 ncl
>>>
>>> When I execute the loop for 5 iterations, its using 551mb/450mb...
>>> 6283 jbzambon 25 0 551m 450m 5840 S 0.0 11.4 0:58.14 ncl
>>>
>>> When I run a list_vars command, I get the following variables...
>>> ncl 52> list_vars
>>> string in_file [ 1 ]
>>> integer time [ 1 ]
>>> string var_names_time [ 1 ]
>>> integer south_north [ 1 ]
>>> string maxtime [ 673 ]
>>> _FillValue
>>> description
>>> integer DateStrLen [ 1 ]
>>> integer bottom_top [ 1 ]
>>> string dim_names [ 5 ]
>>> integer west_east [ 1 ]
>>> string dim_names_time [ 2 ]
>>> logical dimUnlim [ 5 ]
>>> integer b [ 3 ]
>>> float T [ bottom_top | 30 ] x [ south_north | 515 ] x [ west_east | 380 ]
>>> FieldType
>>> MemoryOrder
>>> description
>>> units
>>> stagger
>>> coordinates
>>> string dim_names_3D [ 3 ]
>>> string out_file [ 1 ]
>>> string var_types_3D [ 10 ]
>>> string var_types_time [ 1 ]
>>> string var_names_3D [ 10 ]
>>> integer dim_sizes [ 5 ]
>>> integer ntimes [ 1 ]
>>>
>>> The only substantial source of memory listed is T (30x515x380)...
>>> 30X515X380 = 5,871,000 floating point integers * 32bits/float =
>>> 187,872,000bits * 8,388,608bits/Mb = 22.39Mb
>>>
>>> When I delete that float, I get the following memory usage., its using
>>> 528mb/428mb
>>> 6283 jbzambon 15 0 528m 428m 5844 S 0.0 10.8 0:58.17 ncl
>>> 551mb-528mb=23mb / 450mb-428mb=22mb, so this makes sense...
>>>
>>> Somewhere in the iteration of the loop though, memory is being
>>> reserved that is not being reported by list_vars, so I can't delete
>>> anything in order to reduce memory usage.
>>>
>>> I've tried using the last version of the code, as well as your
>>> suggestion to use -1 instead of running a loop. I replaced my loop,
>>> as you suggested with
>>> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
>>> When I monitor memory usage while using your suggestion. THe amount
>>> of reserved memory fluctuates between 1.6gb and 2.8gb...
>>> 6351 jbzambon 18 0 2954m 714m 5496 R 50.0 18.1 0:04.54 ncl
>>> 6351 jbzambon 18 0 2954m 1.6g 5496 D 51.5 41.0 0:13.29 ncl
>>> 6351 jbzambon 18 0 2954m 1.9g 5496 D 61.8 49.9 0:16.73 ncl
>>> 6351 jbzambon 18 0 2954m 2.3g 5496 R 63.3 59.5 0:20.48 ncl
>>> 6351 jbzambon 18 0 2954m 2.5g 5496 R 67.7 63.8 0:22.52 ncl
>>> 6351 jbzambon 18 0 2954m 2.6g 5496 D 63.4 68.3 0:24.43 ncl
>>> 6351 jbzambon 17 0 2954m 2.2g 6760 D 10.3 58.0 0:30.15 ncl
>>> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
>>> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
>>>
>>> After about 35s of running, I get a segmentation fault....
>>> Segmentation fault (core dumped)
>>>
>>> If I try to open the netCDF file to see what frame it died, I am unable...
>>> ncdump -h /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc
>>> ncdump: /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc: Not a
>>> netCDF file
>>>
>>> I haven't yet worked at predefining the netCDF file yet, I'm guessing
>>> that has to do with I/O efficiency and wouldn't be a cause of this issue.
>>>
>>> Also, where are the functions which WRFUserARW.ncl calls located? For
>>> example:
>>> tk = wrf_tk( P , T ) ; calculate TK
>>> slp = wrf_slp( z, tk, P, QVAPOR ) ; calculate slp
>>> I'm guessing its somewhere in the ncl source as its not in:
>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/gsn_code.ncl"
>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/wrf/WRFUserARW.ncl"
>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/contributed.ncl"
>>> And I can't see anything in wrfW.c that does that actual computation.
>>>
>>> If you have any other ideas, please let me know. Thanks for your
>>> suggestions and help!
>>>
>>> -Joe
>>>
>>>
>>>
>>>
>>> On Jan 21, 2011, at 12:57 PM, Mary Haley wrote:
>>>
>>>> Hi Joseph,
>>>>
>>>> I'm guessing you are simply running out of memory. Once you are done
>>>> with a do loop,
>>>> you might try deleting any variables you don't need before you
>>>> proceed to the next do loop.
>>>>
>>>> Also, have you tried getting rid of the "do" loops and using "-1" as
>>>> the last argument to
>>>> wrf_user_getvar? This allows you to read in all time steps for that
>>>> variable without
>>>> having to loop.
>>>>
>>>> What I'm thinking about may look like your third attempt, except
>>>> without the do loop:
>>>>
>>>>> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
>>>>> ncf->mdbz = wrf_user_getvar(a,"mdbz",-1) ;MDBZ
>>>>> ncf->td2 = wrf_user_getvar(a,"td2",-1) ;TD2
>>>>
>>>> If you predefine the NetCDF file that you're writing to, this will
>>>> help your code be far more efficient,
>>>> especially if you are writing a lot of data to the file.
>>>>
>>>> Please see our examples page on this:
>>>>
>>>> http://www.ncl.ucar.edu/Applications/method_2.shtml
>>>>
>>>> Let me know if this doesn't help and we'll try something else.
>>>>
>>>> --Mary
>>>>
>>>> On Jan 20, 2011, at 2:35 PM, Joseph Zambon wrote:
>>>>
>>>>> NCL Users,
>>>>>
>>>>> I have a script that runs through a wrfout file and pulls out
>>>>> diagnostics as well as some variables I need to use. The variables
>>>>> are removed without a problem, however when I run this script for
>>>>> larger domains, the diagnostics appear to crash the system. I've
>>>>> been running this on a serial compute node as I've crashed the login
>>>>> node a few times (to the aggravation of other users, I'm sure) while
>>>>> trying to debug this.
>>>>>
>>>>> Attached is my script as it is right now (ncl_out.ncl). I've tried
>>>>> a coupled of different approaches to running the diagnostics for a
>>>>> domain that is 379(w_e) x 514 (s_n) and 673 timesteps. Using the
>>>>> diagnostic SLP and MDBZ as examples...
>>>>>
>>>>> Version 1:
>>>>> Original use was to create an array (ar_3d), save SLP to array at
>>>>> every time step, save array to netCDF file, then overwrite the array
>>>>> for the next variable (MDBZ). This way I was only using one large
>>>>> array (379x514x673) to pull out the variables. This script would
>>>>> crash after ~120 timesteps (not sure exactly where as it doesn't
>>>>> save the array until the end and would not print out the times right
>>>>> before dying). This way was probably the least efficient use of memory.
>>>>> *********************************************************************************
>>>>> ;create diagnostics array
>>>>> ar_3d = new( (/ ntimes, south_north, west_east /), float)
>>>>>
>>>>> ;Diagnostics
>>>>> do time = 0,ntimes-1,1 ;SLP
>>>>> slp = wrf_user_getvar(a,"slp",time) ; Sea level
>>>>> pressure in hPa
>>>>> ar_3d(time,:,:)=slp
>>>>> print(time)
>>>>> end do
>>>>> ncf->slp = ar_3d
>>>>>
>>>>> do time = 0,ntimes-1,1 ;MDBZ
>>>>> mdbz = wrf_user_getvar(a,"mdbz",time) ; Simulated radar refl.
>>>>> ar_3d(time,:,:)=mdbz
>>>>> print(time)
>>>>> end do
>>>>> ncf->mdbz = ar_3d
>>>>> *********************************************************************************
>>>>>
>>>>> Version 2:
>>>>> I assumed that this array was getting too large and was choking my
>>>>> memory. My next attempt was to save the output directly to the
>>>>> netCDF file and not use an array in the middle. This script would
>>>>> crash after 234 timesteps (the netCDF file wrote 234 frames and then
>>>>> died while processing the 235th).
>>>>> *********************************************************************************
>>>>> ;compute diagnostics and export to netCDF file
>>>>> do time = 0,ntimes-1,1 ;SLP
>>>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/)
>>>>> print(time)
>>>>> end do
>>>>>
>>>>> ;compute diagnostics and export to netCDF file
>>>>> do time = 0,ntimes-1,1 ;MDBZ
>>>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/)
>>>>> print(time)
>>>>> end do
>>>>> *********************************************************************************
>>>>>
>>>>> Version 3:
>>>>> My last attempt was to bunch all of the diagnostics within the same
>>>>> time loop. I'm not sure how efficient this is for I/O, it didn't
>>>>> seem to take much longer and it saved all 3 diagnostic variables to
>>>>> the netCDF file. This one stopped at 241 timesteps.
>>>>> *********************************************************************************
>>>>> ;compute diagnostics and export to netCDF file
>>>>> do time = 0,ntimes-1,1
>>>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/) ;SLP
>>>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/) ;MDBZ
>>>>> ncf->td2(time,:,:) = (/wrf_user_getvar(a,"td2",time)/) ;TD2
>>>>> print(time)
>>>>> end do
>>>>> *********************************************************************************
>>>>>
>>>>> If anyone has any advice on why my script might be crashing, it
>>>>> would be greatly appreciated. I'm chasing the path down memory
>>>>> allocation (should be a max of 4Gb per node), although I may be way
>>>>> off...
>>>>>
>>>>> CPU time : 1882.70 sec.
>>>>> Max Memory : 3256 MB
>>>>> Max Swap : 11190 MB
>>>>>
>>>>> Max Processes : 10
>>>>> Max Threads : 12
>>>>>
>>>>> Also attached is the text outfile (ncl.out) from my last analysis,
>>>>> they're all more or less the same (except the length of analysis).
>>>>> As shown in the outfile, I'm using NCL v. 5.2.0.
>>>>>
>>>>> Of course, any general critiques on how to make this NCL script more
>>>>> efficient would also be greatly appreciated.
>>>>>
>>>>> Please let me know if I neglected to mention anything important.
>>>>> Thanks!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>>
>>>>> Joseph B. Zambon
>>>>> jbzambon@ncsu.edu <mailto:jbzambon@ncsu.edu>
>>>>> NC State University
>>>>> Department of Marine, Earth and Atmospheric Sciences
>>>>>
>>>>>
>>>>> <ncl_out.ncl.txt>
>>>>>
>>>>> <ncl.out>
>>>>>
>>>>> _______________________________________________
>>>>> ncl-talk mailing list
>>>>> List instructions, subscriber options, unsubscribe:
>>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>>>
>>>
>>
>> =
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> ncl-talk mailing list
>> List instructions, subscriber options, unsubscribe:
>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>
> _______________________________________________
> ncl-talk mailing list
> List instructions, subscriber options, unsubscribe:
> http://mailman.ucar.edu/mailman/listinfo/ncl-talk

_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Mon Jan 24 08:49:05 2011

This archive was generated by hypermail 2.1.8 : Tue Jan 25 2011 - 14:22:15 MST