Re: Script crashes while computing WRF diagnostics (SLP, MDBZ, TD2)

From: Mary Haley <haley_at_nyahnyahspammersnyahnyah>
Date: Tue Jan 25 2011 - 09:43:14 MST

Hi Gus,

Thanks for helping out with this.

Joe, I tried to download your file, but I wasn't sure which ftp site you meant for me to use.
I tried ftp.cgd.ucar.edu, but didn't see the directory or file you mentioned.

However, what Gus states below, it sounds like you could be using a lot of memory.

If you still want me to look at the file, please tell me which ftp site to use.

Thanks,

--Mary

On Jan 24, 2011, at 3:45 PM, Gus Correa wrote:

> Hi Joe
>
> Sorry, my discussion with Mary and Dennis about memory leak
> a year or two ago was probably offline, not on the list ...
> until Mary told me to use the list! :)
> However, it all boiled down to what Dennis identified back then and what
> I wrote here: I had an informative 'print' command at every time step
> that would eventually consume all RAM.
> Once it was commented out, the script ran fine.
> However, this may not be the problem with your script either.
>
> The item on your code that looks similar to mine is the 'print(time)'
> command, which you commented out in your latest experiment, right?
> In my NCL scripts I just removed any print commands from long loops,
> or decimated them (say, print every N/100 or N/1000 loop iterations,
> where N is the loop size, or at the beginning, end, and halfway).
>
> In any case, if I understood right what you are doing, you store
> the 'frames' (time frames?) in variables like ncf->slp(time,:,:).
> You say the wrfout file had 673 frames and 313GB.
> This makes it about 0.465GB/frame.
> Then you accumulate 223 of them, after which the script goes south.
> At this point you reached ~104GB of RAM use, if I did the math right.
> This is a lot of memory!
> Do you have a Nehalem system with 108GB RAM?,
> Or is it a Opteron sysem with 128GB?
> It must have a lot of RAM, but not infinite.
>
> In any case, I guess what is happening is
> what Mary has been saying all along, you may have ran out of memory.
> If this is the case, you may need to handle a smaller number of frames
> at a time, or perhaps reduce their spatial resolution before you store
> them in memory (if this is an option 4U).
>
> Try using 'top' in a separate Linux window, to monitor the
> amount of memory you are consuming as the script runs.
>
> Another two cents,
> Gus Correa
>
>
> Joseph Zambon wrote:
>> Mary,
>>
>> I put some test files in /incoming/jbzambon...
>> jbzambon.nc:
>> The wrfout I was working with had 673 frames (313Gb), so I used ncks to reduce those to 2 and got the file to under a Gb.
>> ncks wrfout_d01_2010-02-01_00:00:00 -d Time,1,2 jbzambon.nc
>>
>> jbzambon.ncl:
>> As the wrfout is 2 frames, I modified my script a little but am able to reproduce the issue by simply looping the first frame repeatedly. The script ran to 223 before dying. I commented out the original loop and replaced it with something that simply runs the diagnostics on the first frame of data 673 times. I attached a diff in case this is confusing...
>>
>> There is another file ncl_out_v4.ncl. I uploaded this before making enough changes and testing, and did not have permission to replace or delete. Please ignore/delete this file.
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> Thank you for providing me the location of the wrf_user.f. I will take a look at that routine to see if I can compile/run it outside of NCL and see if maybe there is a memory issue with that routine.
>>
>> [jbzambon@login03 OUT]$ ncl -V
>> 5.2.0
>> [jbzambon@login03 OUT]$ uname -a
>> Linux login03 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
>>
>> Thanks again!
>>
>> -Joe
>>
>>
>> Gus,
>>
>> I was unable to find anything in the code that was printing messages and taking memory... I was also unable to find a discussion of that in the ncl-talk archives. If you have more information about that, it would be greatly appreciated.
>>
>> Thanks for your $0.02!
>>
>> -Joe
>>
>>
>>
>> On Jan 24, 2011, at 10:48 AM, Mary Haley wrote:
>>
>>> On Jan 21, 2011, at 4:51 PM, Gus Correa wrote:
>>>
>>>> Hi Mary and Joe (and Dennis, if he's around)
>>>>
>>>> Some time ago I suffered from a similar memory failure/leak,
>>>> where my NCL scripts would progressively slow down,
>>>> and take up all memory.
>>>> This was with NCL 5.1.1.
>>>>
>>>> Back then the problem was traced to printing messages
>>>> from inside a loop.
>>>> Dennis pointed out that memory allocated to hold the message strings
>>>> was not being freed by NCL (if I remember right).
>>>> Mary and Dennis may remember this.
>>>>
>>>> Once the print commands were commented out, the problem disappeared.
>>>>
>>>> Could this be Joe's problem?
>>> It's certainly possible. It depends on the nature of the strings being used, and how
>>> many of them there are. This situation was improved a little, but there is still
>>> an issue.
>>>
>>> If you are doing a lot of looping, and printing out strings inside the loop, this
>>> can quickly add up to a problem.
>>>
>>> --Mary
>>>
>>>> My two cents,
>>>> Gus Correa
>>>>
>>>> Mary Haley wrote:
>>>>> Hi Joe,
>>>>>
>>>>> What version of NCL are you running?
>>>>>
>>>>> There could be some internal memory being grabbed that you won't see via
>>>>> list_vars. As you alluded to below,
>>>>> it is likely one of the wrf_xxx functions, as they have to internally
>>>>> create a copy of your input arrays to convert
>>>>> them to double precision (since they come in as float). Also, in order
>>>>> to calculate "slp", you have to grab roughly
>>>>> 6 variables off the file, all of which are mutiple dimension arrays of
>>>>> size 30 x 515 x 380. This may start to
>>>>> add up considering you have to convert some of these to double precision.
>>>>>
>>>>> However, these arrays should be getting freed up after you calculate
>>>>> each quantity.
>>>>>
>>>>> Would you be able to provide us with your latest script and data files
>>>>> so we can try running it here? You can use our ftp
>>>>> site (see http://www.ncl.ucar.edu/report_bug.shtml for ftp info), or put
>>>>> the files on your own site and we'll get them.
>>>>>
>>>>> The actual WRF computations are done in various Fortran files in
>>>>> $NCARG/ni/src/lib/nfpfort. The wrf_slp one is
>>>>> in wrf_user.f. I've attached the file here.
>>>>>
>>>>>
>>>>> --Mary
>>>>> =
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> On Jan 21, 2011, at 3:00 PM, Joseph Zambon wrote:
>>>>>
>>>>>> Mary,
>>>>>>
>>>>>> I'm not sure how NCL reserves memory, but I think there may be a
>>>>>> "leak" somewhere while the loop is iterating.
>>>>>>
>>>>>> When NCL starts, it reserves 5936kb of physical memory, 170mb of
>>>>>> virtual memory...
>>>>>> 6283 jbzambon 16 0 107m 5936 4060 S 0.0 0.1 0:00.02 ncl
>>>>>>
>>>>>> When I run the script to just before running the loop, its using
>>>>>> 69mb/551mb...
>>>>>> 6283 jbzambon 15 0 170m 69m 5532 S 0.0 1.8 0:00.51 ncl
>>>>>>
>>>>>> When I execute the loop for 5 iterations, its using 551mb/450mb...
>>>>>> 6283 jbzambon 25 0 551m 450m 5840 S 0.0 11.4 0:58.14 ncl
>>>>>>
>>>>>> When I run a list_vars command, I get the following variables...
>>>>>> ncl 52> list_vars
>>>>>> string in_file [ 1 ]
>>>>>> integer time [ 1 ]
>>>>>> string var_names_time [ 1 ]
>>>>>> integer south_north [ 1 ]
>>>>>> string maxtime [ 673 ]
>>>>>> _FillValue
>>>>>> description
>>>>>> integer DateStrLen [ 1 ]
>>>>>> integer bottom_top [ 1 ]
>>>>>> string dim_names [ 5 ]
>>>>>> integer west_east [ 1 ]
>>>>>> string dim_names_time [ 2 ]
>>>>>> logical dimUnlim [ 5 ]
>>>>>> integer b [ 3 ]
>>>>>> float T [ bottom_top | 30 ] x [ south_north | 515 ] x [ west_east | 380 ]
>>>>>> FieldType
>>>>>> MemoryOrder
>>>>>> description
>>>>>> units
>>>>>> stagger
>>>>>> coordinates
>>>>>> string dim_names_3D [ 3 ]
>>>>>> string out_file [ 1 ]
>>>>>> string var_types_3D [ 10 ]
>>>>>> string var_types_time [ 1 ]
>>>>>> string var_names_3D [ 10 ]
>>>>>> integer dim_sizes [ 5 ]
>>>>>> integer ntimes [ 1 ]
>>>>>>
>>>>>> The only substantial source of memory listed is T (30x515x380)...
>>>>>> 30X515X380 = 5,871,000 floating point integers * 32bits/float =
>>>>>> 187,872,000bits * 8,388,608bits/Mb = 22.39Mb
>>>>>>
>>>>>> When I delete that float, I get the following memory usage., its using
>>>>>> 528mb/428mb
>>>>>> 6283 jbzambon 15 0 528m 428m 5844 S 0.0 10.8 0:58.17 ncl
>>>>>> 551mb-528mb=23mb / 450mb-428mb=22mb, so this makes sense...
>>>>>>
>>>>>> Somewhere in the iteration of the loop though, memory is being
>>>>>> reserved that is not being reported by list_vars, so I can't delete
>>>>>> anything in order to reduce memory usage.
>>>>>>
>>>>>> I've tried using the last version of the code, as well as your
>>>>>> suggestion to use -1 instead of running a loop. I replaced my loop,
>>>>>> as you suggested with
>>>>>> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
>>>>>> When I monitor memory usage while using your suggestion. THe amount
>>>>>> of reserved memory fluctuates between 1.6gb and 2.8gb...
>>>>>> 6351 jbzambon 18 0 2954m 714m 5496 R 50.0 18.1 0:04.54 ncl
>>>>>> 6351 jbzambon 18 0 2954m 1.6g 5496 D 51.5 41.0 0:13.29 ncl
>>>>>> 6351 jbzambon 18 0 2954m 1.9g 5496 D 61.8 49.9 0:16.73 ncl
>>>>>> 6351 jbzambon 18 0 2954m 2.3g 5496 R 63.3 59.5 0:20.48 ncl
>>>>>> 6351 jbzambon 18 0 2954m 2.5g 5496 R 67.7 63.8 0:22.52 ncl
>>>>>> 6351 jbzambon 18 0 2954m 2.6g 5496 D 63.4 68.3 0:24.43 ncl
>>>>>> 6351 jbzambon 17 0 2954m 2.2g 6760 D 10.3 58.0 0:30.15 ncl
>>>>>> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
>>>>>> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
>>>>>>
>>>>>> After about 35s of running, I get a segmentation fault....
>>>>>> Segmentation fault (core dumped)
>>>>>>
>>>>>> If I try to open the netCDF file to see what frame it died, I am unable...
>>>>>> ncdump -h /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc
>>>>>> ncdump: /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc: Not a
>>>>>> netCDF file
>>>>>>
>>>>>> I haven't yet worked at predefining the netCDF file yet, I'm guessing
>>>>>> that has to do with I/O efficiency and wouldn't be a cause of this issue.
>>>>>>
>>>>>> Also, where are the functions which WRFUserARW.ncl calls located? For
>>>>>> example:
>>>>>> tk = wrf_tk( P , T ) ; calculate TK
>>>>>> slp = wrf_slp( z, tk, P, QVAPOR ) ; calculate slp
>>>>>> I'm guessing its somewhere in the ncl source as its not in:
>>>>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/gsn_code.ncl"
>>>>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/wrf/WRFUserARW.ncl"
>>>>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/contributed.ncl"
>>>>>> And I can't see anything in wrfW.c that does that actual computation.
>>>>>>
>>>>>> If you have any other ideas, please let me know. Thanks for your
>>>>>> suggestions and help!
>>>>>>
>>>>>> -Joe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jan 21, 2011, at 12:57 PM, Mary Haley wrote:
>>>>>>
>>>>>>> Hi Joseph,
>>>>>>>
>>>>>>> I'm guessing you are simply running out of memory. Once you are done
>>>>>>> with a do loop,
>>>>>>> you might try deleting any variables you don't need before you
>>>>>>> proceed to the next do loop.
>>>>>>>
>>>>>>> Also, have you tried getting rid of the "do" loops and using "-1" as
>>>>>>> the last argument to
>>>>>>> wrf_user_getvar? This allows you to read in all time steps for that
>>>>>>> variable without
>>>>>>> having to loop.
>>>>>>>
>>>>>>> What I'm thinking about may look like your third attempt, except
>>>>>>> without the do loop:
>>>>>>>
>>>>>>>> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
>>>>>>>> ncf->mdbz = wrf_user_getvar(a,"mdbz",-1) ;MDBZ
>>>>>>>> ncf->td2 = wrf_user_getvar(a,"td2",-1) ;TD2
>>>>>>> If you predefine the NetCDF file that you're writing to, this will
>>>>>>> help your code be far more efficient,
>>>>>>> especially if you are writing a lot of data to the file.
>>>>>>>
>>>>>>> Please see our examples page on this:
>>>>>>>
>>>>>>> http://www.ncl.ucar.edu/Applications/method_2.shtml
>>>>>>>
>>>>>>> Let me know if this doesn't help and we'll try something else.
>>>>>>>
>>>>>>> --Mary
>>>>>>>
>>>>>>> On Jan 20, 2011, at 2:35 PM, Joseph Zambon wrote:
>>>>>>>
>>>>>>>> NCL Users,
>>>>>>>>
>>>>>>>> I have a script that runs through a wrfout file and pulls out
>>>>>>>> diagnostics as well as some variables I need to use. The variables
>>>>>>>> are removed without a problem, however when I run this script for
>>>>>>>> larger domains, the diagnostics appear to crash the system. I've
>>>>>>>> been running this on a serial compute node as I've crashed the login
>>>>>>>> node a few times (to the aggravation of other users, I'm sure) while
>>>>>>>> trying to debug this.
>>>>>>>>
>>>>>>>> Attached is my script as it is right now (ncl_out.ncl). I've tried
>>>>>>>> a coupled of different approaches to running the diagnostics for a
>>>>>>>> domain that is 379(w_e) x 514 (s_n) and 673 timesteps. Using the
>>>>>>>> diagnostic SLP and MDBZ as examples...
>>>>>>>>
>>>>>>>> Version 1:
>>>>>>>> Original use was to create an array (ar_3d), save SLP to array at
>>>>>>>> every time step, save array to netCDF file, then overwrite the array
>>>>>>>> for the next variable (MDBZ). This way I was only using one large
>>>>>>>> array (379x514x673) to pull out the variables. This script would
>>>>>>>> crash after ~120 timesteps (not sure exactly where as it doesn't
>>>>>>>> save the array until the end and would not print out the times right
>>>>>>>> before dying). This way was probably the least efficient use of memory.
>>>>>>>> *********************************************************************************
>>>>>>>> ;create diagnostics array
>>>>>>>> ar_3d = new( (/ ntimes, south_north, west_east /), float)
>>>>>>>>
>>>>>>>> ;Diagnostics
>>>>>>>> do time = 0,ntimes-1,1 ;SLP
>>>>>>>> slp = wrf_user_getvar(a,"slp",time) ; Sea level
>>>>>>>> pressure in hPa
>>>>>>>> ar_3d(time,:,:)=slp
>>>>>>>> print(time)
>>>>>>>> end do
>>>>>>>> ncf->slp = ar_3d
>>>>>>>>
>>>>>>>> do time = 0,ntimes-1,1 ;MDBZ
>>>>>>>> mdbz = wrf_user_getvar(a,"mdbz",time) ; Simulated radar refl.
>>>>>>>> ar_3d(time,:,:)=mdbz
>>>>>>>> print(time)
>>>>>>>> end do
>>>>>>>> ncf->mdbz = ar_3d
>>>>>>>> *********************************************************************************
>>>>>>>>
>>>>>>>> Version 2:
>>>>>>>> I assumed that this array was getting too large and was choking my
>>>>>>>> memory. My next attempt was to save the output directly to the
>>>>>>>> netCDF file and not use an array in the middle. This script would
>>>>>>>> crash after 234 timesteps (the netCDF file wrote 234 frames and then
>>>>>>>> died while processing the 235th).
>>>>>>>> *********************************************************************************
>>>>>>>> ;compute diagnostics and export to netCDF file
>>>>>>>> do time = 0,ntimes-1,1 ;SLP
>>>>>>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/)
>>>>>>>> print(time)
>>>>>>>> end do
>>>>>>>>
>>>>>>>> ;compute diagnostics and export to netCDF file
>>>>>>>> do time = 0,ntimes-1,1 ;MDBZ
>>>>>>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/)
>>>>>>>> print(time)
>>>>>>>> end do
>>>>>>>> *********************************************************************************
>>>>>>>>
>>>>>>>> Version 3:
>>>>>>>> My last attempt was to bunch all of the diagnostics within the same
>>>>>>>> time loop. I'm not sure how efficient this is for I/O, it didn't
>>>>>>>> seem to take much longer and it saved all 3 diagnostic variables to
>>>>>>>> the netCDF file. This one stopped at 241 timesteps.
>>>>>>>> *********************************************************************************
>>>>>>>> ;compute diagnostics and export to netCDF file
>>>>>>>> do time = 0,ntimes-1,1
>>>>>>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/) ;SLP
>>>>>>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/) ;MDBZ
>>>>>>>> ncf->td2(time,:,:) = (/wrf_user_getvar(a,"td2",time)/) ;TD2
>>>>>>>> print(time)
>>>>>>>> end do
>>>>>>>> *********************************************************************************
>>>>>>>>
>>>>>>>> If anyone has any advice on why my script might be crashing, it
>>>>>>>> would be greatly appreciated. I'm chasing the path down memory
>>>>>>>> allocation (should be a max of 4Gb per node), although I may be way
>>>>>>>> off...
>>>>>>>>
>>>>>>>> CPU time : 1882.70 sec.
>>>>>>>> Max Memory : 3256 MB
>>>>>>>> Max Swap : 11190 MB
>>>>>>>>
>>>>>>>> Max Processes : 10
>>>>>>>> Max Threads : 12
>>>>>>>>
>>>>>>>> Also attached is the text outfile (ncl.out) from my last analysis,
>>>>>>>> they're all more or less the same (except the length of analysis).
>>>>>>>> As shown in the outfile, I'm using NCL v. 5.2.0.
>>>>>>>>
>>>>>>>> Of course, any general critiques on how to make this NCL script more
>>>>>>>> efficient would also be greatly appreciated.
>>>>>>>>
>>>>>>>> Please let me know if I neglected to mention anything important.
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> -Joe
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Joseph B. Zambon
>>>>>>>> jbzambon@ncsu.edu <mailto:jbzambon@ncsu.edu>
>>>>>>>> NC State University
>>>>>>>> Department of Marine, Earth and Atmospheric Sciences
>>>>>>>>
>>>>>>>>
>>>>>>>> <ncl_out.ncl.txt>
>>>>>>>>
>>>>>>>> <ncl.out>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ncl-talk mailing list
>>>>>>>> List instructions, subscriber options, unsubscribe:
>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>>>> =
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> ncl-talk mailing list
>>>>> List instructions, subscriber options, unsubscribe:
>>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>>> _______________________________________________
>>>> ncl-talk mailing list
>>>> List instructions, subscriber options, unsubscribe:
>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>> _______________________________________________
>>> ncl-talk mailing list
>>> List instructions, subscriber options, unsubscribe:
>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>
>
> _______________________________________________
> ncl-talk mailing list
> List instructions, subscriber options, unsubscribe:
> http://mailman.ucar.edu/mailman/listinfo/ncl-talk

_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Tue Jan 25 09:43:21 2011

This archive was generated by hypermail 2.1.8 : Tue Jan 25 2011 - 14:22:15 MST