Re: Script crashes while computing WRF diagnostics (SLP, MDBZ, TD2)

From: Mary Haley <haley_at_nyahnyahspammersnyahnyah>
Date: Tue Jan 25 2011 - 14:19:37 MST

Joe,

I tried your script using v5.2.1 on my Mac, and it was still running after 300 iterations.

I then killed the script, and modified it that there is no looping:

ncf->slp = (/wrf_user_getvar(a,"slp",-1)/)
ncf->mdbz = (/wrf_user_getvar(a,"mdbz",-1)/)
ncf->td2 = (/wrf_user_getvar(a,"td2",-1)/)
ncf->lat = a->XLAT
ncf->lon = a->XLONG
ncf->landmask = a->LANDMASK
ncf->t2 = a->T2
ncf->u10 = a->U10
ncf->v10 = a->V10
ncf->sst = a->SST
ncf->Times = a->Times

This ran in about 10 wall clock seconds with no errors. Can you try the same thing on your end and see if it works?

It is much better if you can do this without looping, because this can slow your code down considerably.

--Mary

On Jan 24, 2011, at 1:35 PM, Joseph Zambon wrote:

> Mary,
>
> I put some test files in /incoming/jbzambon...
> jbzambon.nc:
> The wrfout I was working with had 673 frames (313Gb), so I used ncks to reduce those to 2 and got the file to under a Gb.
> ncks wrfout_d01_2010-02-01_00:00:00 -d Time,1,2 jbzambon.nc
>
> jbzambon.ncl:
> As the wrfout is 2 frames, I modified my script a little but am able to reproduce the issue by simply looping the first frame repeatedly. The script ran to 223 before dying. I commented out the original loop and replaced it with something that simply runs the diagnostics on the first frame of data 673 times. I attached a diff in case this is confusing...
>
> There is another file ncl_out_v4.ncl. I uploaded this before making enough changes and testing, and did not have permission to replace or delete. Please ignore/delete this file.
>
> <diff.log>
>
> Thank you for providing me the location of the wrf_user.f. I will take a look at that routine to see if I can compile/run it outside of NCL and see if maybe there is a memory issue with that routine.
>
> [jbzambon@login03 OUT]$ ncl -V
> 5.2.0
> [jbzambon@login03 OUT]$ uname -a
> Linux login03 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
>
> Thanks again!
>
> -Joe
>
>
> Gus,
>
> I was unable to find anything in the code that was printing messages and taking memory... I was also unable to find a discussion of that in the ncl-talk archives. If you have more information about that, it would be greatly appreciated.
>
> Thanks for your $0.02!
>
> -Joe
>
>
>
> On Jan 24, 2011, at 10:48 AM, Mary Haley wrote:
>
>>
>> On Jan 21, 2011, at 4:51 PM, Gus Correa wrote:
>>
>>> Hi Mary and Joe (and Dennis, if he's around)
>>>
>>> Some time ago I suffered from a similar memory failure/leak,
>>> where my NCL scripts would progressively slow down,
>>> and take up all memory.
>>> This was with NCL 5.1.1.
>>>
>>> Back then the problem was traced to printing messages
>>> from inside a loop.
>>> Dennis pointed out that memory allocated to hold the message strings
>>> was not being freed by NCL (if I remember right).
>>> Mary and Dennis may remember this.
>>>
>>> Once the print commands were commented out, the problem disappeared.
>>>
>>> Could this be Joe's problem?
>>
>> It's certainly possible. It depends on the nature of the strings being used, and how
>> many of them there are. This situation was improved a little, but there is still
>> an issue.
>>
>> If you are doing a lot of looping, and printing out strings inside the loop, this
>> can quickly add up to a problem.
>>
>> --Mary
>>
>>>
>>> My two cents,
>>> Gus Correa
>>>
>>> Mary Haley wrote:
>>>> Hi Joe,
>>>>
>>>> What version of NCL are you running?
>>>>
>>>> There could be some internal memory being grabbed that you won't see via
>>>> list_vars. As you alluded to below,
>>>> it is likely one of the wrf_xxx functions, as they have to internally
>>>> create a copy of your input arrays to convert
>>>> them to double precision (since they come in as float). Also, in order
>>>> to calculate "slp", you have to grab roughly
>>>> 6 variables off the file, all of which are mutiple dimension arrays of
>>>> size 30 x 515 x 380. This may start to
>>>> add up considering you have to convert some of these to double precision.
>>>>
>>>> However, these arrays should be getting freed up after you calculate
>>>> each quantity.
>>>>
>>>> Would you be able to provide us with your latest script and data files
>>>> so we can try running it here? You can use our ftp
>>>> site (see http://www.ncl.ucar.edu/report_bug.shtml for ftp info), or put
>>>> the files on your own site and we'll get them.
>>>>
>>>> The actual WRF computations are done in various Fortran files in
>>>> $NCARG/ni/src/lib/nfpfort. The wrf_slp one is
>>>> in wrf_user.f. I've attached the file here.
>>>>
>>>>
>>>> --Mary
>>>> =
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> On Jan 21, 2011, at 3:00 PM, Joseph Zambon wrote:
>>>>
>>>>> Mary,
>>>>>
>>>>> I'm not sure how NCL reserves memory, but I think there may be a
>>>>> "leak" somewhere while the loop is iterating.
>>>>>
>>>>> When NCL starts, it reserves 5936kb of physical memory, 170mb of
>>>>> virtual memory...
>>>>> 6283 jbzambon 16 0 107m 5936 4060 S 0.0 0.1 0:00.02 ncl
>>>>>
>>>>> When I run the script to just before running the loop, its using
>>>>> 69mb/551mb...
>>>>> 6283 jbzambon 15 0 170m 69m 5532 S 0.0 1.8 0:00.51 ncl
>>>>>
>>>>> When I execute the loop for 5 iterations, its using 551mb/450mb...
>>>>> 6283 jbzambon 25 0 551m 450m 5840 S 0.0 11.4 0:58.14 ncl
>>>>>
>>>>> When I run a list_vars command, I get the following variables...
>>>>> ncl 52> list_vars
>>>>> string in_file [ 1 ]
>>>>> integer time [ 1 ]
>>>>> string var_names_time [ 1 ]
>>>>> integer south_north [ 1 ]
>>>>> string maxtime [ 673 ]
>>>>> _FillValue
>>>>> description
>>>>> integer DateStrLen [ 1 ]
>>>>> integer bottom_top [ 1 ]
>>>>> string dim_names [ 5 ]
>>>>> integer west_east [ 1 ]
>>>>> string dim_names_time [ 2 ]
>>>>> logical dimUnlim [ 5 ]
>>>>> integer b [ 3 ]
>>>>> float T [ bottom_top | 30 ] x [ south_north | 515 ] x [ west_east | 380 ]
>>>>> FieldType
>>>>> MemoryOrder
>>>>> description
>>>>> units
>>>>> stagger
>>>>> coordinates
>>>>> string dim_names_3D [ 3 ]
>>>>> string out_file [ 1 ]
>>>>> string var_types_3D [ 10 ]
>>>>> string var_types_time [ 1 ]
>>>>> string var_names_3D [ 10 ]
>>>>> integer dim_sizes [ 5 ]
>>>>> integer ntimes [ 1 ]
>>>>>
>>>>> The only substantial source of memory listed is T (30x515x380)...
>>>>> 30X515X380 = 5,871,000 floating point integers * 32bits/float =
>>>>> 187,872,000bits * 8,388,608bits/Mb = 22.39Mb
>>>>>
>>>>> When I delete that float, I get the following memory usage., its using
>>>>> 528mb/428mb
>>>>> 6283 jbzambon 15 0 528m 428m 5844 S 0.0 10.8 0:58.17 ncl
>>>>> 551mb-528mb=23mb / 450mb-428mb=22mb, so this makes sense...
>>>>>
>>>>> Somewhere in the iteration of the loop though, memory is being
>>>>> reserved that is not being reported by list_vars, so I can't delete
>>>>> anything in order to reduce memory usage.
>>>>>
>>>>> I've tried using the last version of the code, as well as your
>>>>> suggestion to use -1 instead of running a loop. I replaced my loop,
>>>>> as you suggested with
>>>>> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
>>>>> When I monitor memory usage while using your suggestion. THe amount
>>>>> of reserved memory fluctuates between 1.6gb and 2.8gb...
>>>>> 6351 jbzambon 18 0 2954m 714m 5496 R 50.0 18.1 0:04.54 ncl
>>>>> 6351 jbzambon 18 0 2954m 1.6g 5496 D 51.5 41.0 0:13.29 ncl
>>>>> 6351 jbzambon 18 0 2954m 1.9g 5496 D 61.8 49.9 0:16.73 ncl
>>>>> 6351 jbzambon 18 0 2954m 2.3g 5496 R 63.3 59.5 0:20.48 ncl
>>>>> 6351 jbzambon 18 0 2954m 2.5g 5496 R 67.7 63.8 0:22.52 ncl
>>>>> 6351 jbzambon 18 0 2954m 2.6g 5496 D 63.4 68.3 0:24.43 ncl
>>>>> 6351 jbzambon 17 0 2954m 2.2g 6760 D 10.3 58.0 0:30.15 ncl
>>>>> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
>>>>> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
>>>>>
>>>>> After about 35s of running, I get a segmentation fault....
>>>>> Segmentation fault (core dumped)
>>>>>
>>>>> If I try to open the netCDF file to see what frame it died, I am unable...
>>>>> ncdump -h /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc
>>>>> ncdump: /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc: Not a
>>>>> netCDF file
>>>>>
>>>>> I haven't yet worked at predefining the netCDF file yet, I'm guessing
>>>>> that has to do with I/O efficiency and wouldn't be a cause of this issue.
>>>>>
>>>>> Also, where are the functions which WRFUserARW.ncl calls located? For
>>>>> example:
>>>>> tk = wrf_tk( P , T ) ; calculate TK
>>>>> slp = wrf_slp( z, tk, P, QVAPOR ) ; calculate slp
>>>>> I'm guessing its somewhere in the ncl source as its not in:
>>>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/gsn_code.ncl"
>>>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/wrf/WRFUserARW.ncl"
>>>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/contributed.ncl"
>>>>> And I can't see anything in wrfW.c that does that actual computation.
>>>>>
>>>>> If you have any other ideas, please let me know. Thanks for your
>>>>> suggestions and help!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Jan 21, 2011, at 12:57 PM, Mary Haley wrote:
>>>>>
>>>>>> Hi Joseph,
>>>>>>
>>>>>> I'm guessing you are simply running out of memory. Once you are done
>>>>>> with a do loop,
>>>>>> you might try deleting any variables you don't need before you
>>>>>> proceed to the next do loop.
>>>>>>
>>>>>> Also, have you tried getting rid of the "do" loops and using "-1" as
>>>>>> the last argument to
>>>>>> wrf_user_getvar? This allows you to read in all time steps for that
>>>>>> variable without
>>>>>> having to loop.
>>>>>>
>>>>>> What I'm thinking about may look like your third attempt, except
>>>>>> without the do loop:
>>>>>>
>>>>>>> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
>>>>>>> ncf->mdbz = wrf_user_getvar(a,"mdbz",-1) ;MDBZ
>>>>>>> ncf->td2 = wrf_user_getvar(a,"td2",-1) ;TD2
>>>>>>
>>>>>> If you predefine the NetCDF file that you're writing to, this will
>>>>>> help your code be far more efficient,
>>>>>> especially if you are writing a lot of data to the file.
>>>>>>
>>>>>> Please see our examples page on this:
>>>>>>
>>>>>> http://www.ncl.ucar.edu/Applications/method_2.shtml
>>>>>>
>>>>>> Let me know if this doesn't help and we'll try something else.
>>>>>>
>>>>>> --Mary
>>>>>>
>>>>>> On Jan 20, 2011, at 2:35 PM, Joseph Zambon wrote:
>>>>>>
>>>>>>> NCL Users,
>>>>>>>
>>>>>>> I have a script that runs through a wrfout file and pulls out
>>>>>>> diagnostics as well as some variables I need to use. The variables
>>>>>>> are removed without a problem, however when I run this script for
>>>>>>> larger domains, the diagnostics appear to crash the system. I've
>>>>>>> been running this on a serial compute node as I've crashed the login
>>>>>>> node a few times (to the aggravation of other users, I'm sure) while
>>>>>>> trying to debug this.
>>>>>>>
>>>>>>> Attached is my script as it is right now (ncl_out.ncl). I've tried
>>>>>>> a coupled of different approaches to running the diagnostics for a
>>>>>>> domain that is 379(w_e) x 514 (s_n) and 673 timesteps. Using the
>>>>>>> diagnostic SLP and MDBZ as examples...
>>>>>>>
>>>>>>> Version 1:
>>>>>>> Original use was to create an array (ar_3d), save SLP to array at
>>>>>>> every time step, save array to netCDF file, then overwrite the array
>>>>>>> for the next variable (MDBZ). This way I was only using one large
>>>>>>> array (379x514x673) to pull out the variables. This script would
>>>>>>> crash after ~120 timesteps (not sure exactly where as it doesn't
>>>>>>> save the array until the end and would not print out the times right
>>>>>>> before dying). This way was probably the least efficient use of memory.
>>>>>>> *********************************************************************************
>>>>>>> ;create diagnostics array
>>>>>>> ar_3d = new( (/ ntimes, south_north, west_east /), float)
>>>>>>>
>>>>>>> ;Diagnostics
>>>>>>> do time = 0,ntimes-1,1 ;SLP
>>>>>>> slp = wrf_user_getvar(a,"slp",time) ; Sea level
>>>>>>> pressure in hPa
>>>>>>> ar_3d(time,:,:)=slp
>>>>>>> print(time)
>>>>>>> end do
>>>>>>> ncf->slp = ar_3d
>>>>>>>
>>>>>>> do time = 0,ntimes-1,1 ;MDBZ
>>>>>>> mdbz = wrf_user_getvar(a,"mdbz",time) ; Simulated radar refl.
>>>>>>> ar_3d(time,:,:)=mdbz
>>>>>>> print(time)
>>>>>>> end do
>>>>>>> ncf->mdbz = ar_3d
>>>>>>> *********************************************************************************
>>>>>>>
>>>>>>> Version 2:
>>>>>>> I assumed that this array was getting too large and was choking my
>>>>>>> memory. My next attempt was to save the output directly to the
>>>>>>> netCDF file and not use an array in the middle. This script would
>>>>>>> crash after 234 timesteps (the netCDF file wrote 234 frames and then
>>>>>>> died while processing the 235th).
>>>>>>> *********************************************************************************
>>>>>>> ;compute diagnostics and export to netCDF file
>>>>>>> do time = 0,ntimes-1,1 ;SLP
>>>>>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/)
>>>>>>> print(time)
>>>>>>> end do
>>>>>>>
>>>>>>> ;compute diagnostics and export to netCDF file
>>>>>>> do time = 0,ntimes-1,1 ;MDBZ
>>>>>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/)
>>>>>>> print(time)
>>>>>>> end do
>>>>>>> *********************************************************************************
>>>>>>>
>>>>>>> Version 3:
>>>>>>> My last attempt was to bunch all of the diagnostics within the same
>>>>>>> time loop. I'm not sure how efficient this is for I/O, it didn't
>>>>>>> seem to take much longer and it saved all 3 diagnostic variables to
>>>>>>> the netCDF file. This one stopped at 241 timesteps.
>>>>>>> *********************************************************************************
>>>>>>> ;compute diagnostics and export to netCDF file
>>>>>>> do time = 0,ntimes-1,1
>>>>>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/) ;SLP
>>>>>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/) ;MDBZ
>>>>>>> ncf->td2(time,:,:) = (/wrf_user_getvar(a,"td2",time)/) ;TD2
>>>>>>> print(time)
>>>>>>> end do
>>>>>>> *********************************************************************************
>>>>>>>
>>>>>>> If anyone has any advice on why my script might be crashing, it
>>>>>>> would be greatly appreciated. I'm chasing the path down memory
>>>>>>> allocation (should be a max of 4Gb per node), although I may be way
>>>>>>> off...
>>>>>>>
>>>>>>> CPU time : 1882.70 sec.
>>>>>>> Max Memory : 3256 MB
>>>>>>> Max Swap : 11190 MB
>>>>>>>
>>>>>>> Max Processes : 10
>>>>>>> Max Threads : 12
>>>>>>>
>>>>>>> Also attached is the text outfile (ncl.out) from my last analysis,
>>>>>>> they're all more or less the same (except the length of analysis).
>>>>>>> As shown in the outfile, I'm using NCL v. 5.2.0.
>>>>>>>
>>>>>>> Of course, any general critiques on how to make this NCL script more
>>>>>>> efficient would also be greatly appreciated.
>>>>>>>
>>>>>>> Please let me know if I neglected to mention anything important.
>>>>>>> Thanks!
>>>>>>>
>>>>>>> -Joe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Joseph B. Zambon
>>>>>>> jbzambon@ncsu.edu <mailto:jbzambon@ncsu.edu>
>>>>>>> NC State University
>>>>>>> Department of Marine, Earth and Atmospheric Sciences
>>>>>>>
>>>>>>>
>>>>>>> <ncl_out.ncl.txt>
>>>>>>>
>>>>>>> <ncl.out>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ncl-talk mailing list
>>>>>>> List instructions, subscriber options, unsubscribe:
>>>>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>>>>>
>>>>>
>>>>
>>>> =
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> ncl-talk mailing list
>>>> List instructions, subscriber options, unsubscribe:
>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>>
>>> _______________________________________________
>>> ncl-talk mailing list
>>> List instructions, subscriber options, unsubscribe:
>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>
>> _______________________________________________
>> ncl-talk mailing list
>> List instructions, subscriber options, unsubscribe:
>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>

_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Tue Jan 25 14:19:46 2011

This archive was generated by hypermail 2.1.8 : Tue Jan 25 2011 - 14:22:15 MST