Re: Script crashes while computing WRF diagnostics (SLP, MDBZ, TD2)

From: Joseph Zambon <jbzambon_at_nyahnyahspammersnyahnyah>
Date: Mon Jan 24 2011 - 13:35:23 MST

Mary,

I put some test files in /incoming/jbzambon...
jbzambon.nc:
The wrfout I was working with had 673 frames (313Gb), so I used ncks to reduce those to 2 and got the file to under a Gb.
ncks wrfout_d01_2010-02-01_00:00:00 -d Time,1,2 jbzambon.nc

jbzambon.ncl:
As the wrfout is 2 frames, I modified my script a little but am able to reproduce the issue by simply looping the first frame repeatedly. The script ran to 223 before dying. I commented out the original loop and replaced it with something that simply runs the diagnostics on the first frame of data 673 times. I attached a diff in case this is confusing...

There is another file ncl_out_v4.ncl. I uploaded this before making enough changes and testing, and did not have permission to replace or delete. Please ignore/delete this file.

Thank you for providing me the location of the wrf_user.f. I will take a look at that routine to see if I can compile/run it outside of NCL and see if maybe there is a memory issue with that routine.

[jbzambon@login03 OUT]$ ncl -V
5.2.0
[jbzambon@login03 OUT]$ uname -a
Linux login03 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Thanks again!

-Joe

Gus,

I was unable to find anything in the code that was printing messages and taking memory... I was also unable to find a discussion of that in the ncl-talk archives. If you have more information about that, it would be greatly appreciated.

Thanks for your $0.02!

-Joe

On Jan 24, 2011, at 10:48 AM, Mary Haley wrote:

>
> On Jan 21, 2011, at 4:51 PM, Gus Correa wrote:
>
>> Hi Mary and Joe (and Dennis, if he's around)
>>
>> Some time ago I suffered from a similar memory failure/leak,
>> where my NCL scripts would progressively slow down,
>> and take up all memory.
>> This was with NCL 5.1.1.
>>
>> Back then the problem was traced to printing messages
>> from inside a loop.
>> Dennis pointed out that memory allocated to hold the message strings
>> was not being freed by NCL (if I remember right).
>> Mary and Dennis may remember this.
>>
>> Once the print commands were commented out, the problem disappeared.
>>
>> Could this be Joe's problem?
>
> It's certainly possible. It depends on the nature of the strings being used, and how
> many of them there are. This situation was improved a little, but there is still
> an issue.
>
> If you are doing a lot of looping, and printing out strings inside the loop, this
> can quickly add up to a problem.
>
> --Mary
>
>>
>> My two cents,
>> Gus Correa
>>
>> Mary Haley wrote:
>>> Hi Joe,
>>>
>>> What version of NCL are you running?
>>>
>>> There could be some internal memory being grabbed that you won't see via
>>> list_vars. As you alluded to below,
>>> it is likely one of the wrf_xxx functions, as they have to internally
>>> create a copy of your input arrays to convert
>>> them to double precision (since they come in as float). Also, in order
>>> to calculate "slp", you have to grab roughly
>>> 6 variables off the file, all of which are mutiple dimension arrays of
>>> size 30 x 515 x 380. This may start to
>>> add up considering you have to convert some of these to double precision.
>>>
>>> However, these arrays should be getting freed up after you calculate
>>> each quantity.
>>>
>>> Would you be able to provide us with your latest script and data files
>>> so we can try running it here? You can use our ftp
>>> site (see http://www.ncl.ucar.edu/report_bug.shtml for ftp info), or put
>>> the files on your own site and we'll get them.
>>>
>>> The actual WRF computations are done in various Fortran files in
>>> $NCARG/ni/src/lib/nfpfort. The wrf_slp one is
>>> in wrf_user.f. I've attached the file here.
>>>
>>>
>>> --Mary
>>> =
>>> ------------------------------------------------------------------------
>>>
>>>
>>> On Jan 21, 2011, at 3:00 PM, Joseph Zambon wrote:
>>>
>>>> Mary,
>>>>
>>>> I'm not sure how NCL reserves memory, but I think there may be a
>>>> "leak" somewhere while the loop is iterating.
>>>>
>>>> When NCL starts, it reserves 5936kb of physical memory, 170mb of
>>>> virtual memory...
>>>> 6283 jbzambon 16 0 107m 5936 4060 S 0.0 0.1 0:00.02 ncl
>>>>
>>>> When I run the script to just before running the loop, its using
>>>> 69mb/551mb...
>>>> 6283 jbzambon 15 0 170m 69m 5532 S 0.0 1.8 0:00.51 ncl
>>>>
>>>> When I execute the loop for 5 iterations, its using 551mb/450mb...
>>>> 6283 jbzambon 25 0 551m 450m 5840 S 0.0 11.4 0:58.14 ncl
>>>>
>>>> When I run a list_vars command, I get the following variables...
>>>> ncl 52> list_vars
>>>> string in_file [ 1 ]
>>>> integer time [ 1 ]
>>>> string var_names_time [ 1 ]
>>>> integer south_north [ 1 ]
>>>> string maxtime [ 673 ]
>>>> _FillValue
>>>> description
>>>> integer DateStrLen [ 1 ]
>>>> integer bottom_top [ 1 ]
>>>> string dim_names [ 5 ]
>>>> integer west_east [ 1 ]
>>>> string dim_names_time [ 2 ]
>>>> logical dimUnlim [ 5 ]
>>>> integer b [ 3 ]
>>>> float T [ bottom_top | 30 ] x [ south_north | 515 ] x [ west_east | 380 ]
>>>> FieldType
>>>> MemoryOrder
>>>> description
>>>> units
>>>> stagger
>>>> coordinates
>>>> string dim_names_3D [ 3 ]
>>>> string out_file [ 1 ]
>>>> string var_types_3D [ 10 ]
>>>> string var_types_time [ 1 ]
>>>> string var_names_3D [ 10 ]
>>>> integer dim_sizes [ 5 ]
>>>> integer ntimes [ 1 ]
>>>>
>>>> The only substantial source of memory listed is T (30x515x380)...
>>>> 30X515X380 = 5,871,000 floating point integers * 32bits/float =
>>>> 187,872,000bits * 8,388,608bits/Mb = 22.39Mb
>>>>
>>>> When I delete that float, I get the following memory usage., its using
>>>> 528mb/428mb
>>>> 6283 jbzambon 15 0 528m 428m 5844 S 0.0 10.8 0:58.17 ncl
>>>> 551mb-528mb=23mb / 450mb-428mb=22mb, so this makes sense...
>>>>
>>>> Somewhere in the iteration of the loop though, memory is being
>>>> reserved that is not being reported by list_vars, so I can't delete
>>>> anything in order to reduce memory usage.
>>>>
>>>> I've tried using the last version of the code, as well as your
>>>> suggestion to use -1 instead of running a loop. I replaced my loop,
>>>> as you suggested with
>>>> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
>>>> When I monitor memory usage while using your suggestion. THe amount
>>>> of reserved memory fluctuates between 1.6gb and 2.8gb...
>>>> 6351 jbzambon 18 0 2954m 714m 5496 R 50.0 18.1 0:04.54 ncl
>>>> 6351 jbzambon 18 0 2954m 1.6g 5496 D 51.5 41.0 0:13.29 ncl
>>>> 6351 jbzambon 18 0 2954m 1.9g 5496 D 61.8 49.9 0:16.73 ncl
>>>> 6351 jbzambon 18 0 2954m 2.3g 5496 R 63.3 59.5 0:20.48 ncl
>>>> 6351 jbzambon 18 0 2954m 2.5g 5496 R 67.7 63.8 0:22.52 ncl
>>>> 6351 jbzambon 18 0 2954m 2.6g 5496 D 63.4 68.3 0:24.43 ncl
>>>> 6351 jbzambon 17 0 2954m 2.2g 6760 D 10.3 58.0 0:30.15 ncl
>>>> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
>>>> 6351 jbzambon 16 0 2954m 2.8g 6760 D 9.0 71.8 0:33.23 ncl
>>>>
>>>> After about 35s of running, I get a segmentation fault....
>>>> Segmentation fault (core dumped)
>>>>
>>>> If I try to open the netCDF file to see what frame it died, I am unable...
>>>> ncdump -h /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc
>>>> ncdump: /he_data/he/jbzambon/projects/feb10/OUT/ncl_out.nc: Not a
>>>> netCDF file
>>>>
>>>> I haven't yet worked at predefining the netCDF file yet, I'm guessing
>>>> that has to do with I/O efficiency and wouldn't be a cause of this issue.
>>>>
>>>> Also, where are the functions which WRFUserARW.ncl calls located? For
>>>> example:
>>>> tk = wrf_tk( P , T ) ; calculate TK
>>>> slp = wrf_slp( z, tk, P, QVAPOR ) ; calculate slp
>>>> I'm guessing its somewhere in the ncl source as its not in:
>>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/gsn_code.ncl"
>>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/wrf/WRFUserARW.ncl"
>>>> load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/contributed.ncl"
>>>> And I can't see anything in wrfW.c that does that actual computation.
>>>>
>>>> If you have any other ideas, please let me know. Thanks for your
>>>> suggestions and help!
>>>>
>>>> -Joe
>>>>
>>>>
>>>>
>>>>
>>>> On Jan 21, 2011, at 12:57 PM, Mary Haley wrote:
>>>>
>>>>> Hi Joseph,
>>>>>
>>>>> I'm guessing you are simply running out of memory. Once you are done
>>>>> with a do loop,
>>>>> you might try deleting any variables you don't need before you
>>>>> proceed to the next do loop.
>>>>>
>>>>> Also, have you tried getting rid of the "do" loops and using "-1" as
>>>>> the last argument to
>>>>> wrf_user_getvar? This allows you to read in all time steps for that
>>>>> variable without
>>>>> having to loop.
>>>>>
>>>>> What I'm thinking about may look like your third attempt, except
>>>>> without the do loop:
>>>>>
>>>>>> ncf->slp = wrf_user_getvar(a,"slp",-1) ;SLP
>>>>>> ncf->mdbz = wrf_user_getvar(a,"mdbz",-1) ;MDBZ
>>>>>> ncf->td2 = wrf_user_getvar(a,"td2",-1) ;TD2
>>>>>
>>>>> If you predefine the NetCDF file that you're writing to, this will
>>>>> help your code be far more efficient,
>>>>> especially if you are writing a lot of data to the file.
>>>>>
>>>>> Please see our examples page on this:
>>>>>
>>>>> http://www.ncl.ucar.edu/Applications/method_2.shtml
>>>>>
>>>>> Let me know if this doesn't help and we'll try something else.
>>>>>
>>>>> --Mary
>>>>>
>>>>> On Jan 20, 2011, at 2:35 PM, Joseph Zambon wrote:
>>>>>
>>>>>> NCL Users,
>>>>>>
>>>>>> I have a script that runs through a wrfout file and pulls out
>>>>>> diagnostics as well as some variables I need to use. The variables
>>>>>> are removed without a problem, however when I run this script for
>>>>>> larger domains, the diagnostics appear to crash the system. I've
>>>>>> been running this on a serial compute node as I've crashed the login
>>>>>> node a few times (to the aggravation of other users, I'm sure) while
>>>>>> trying to debug this.
>>>>>>
>>>>>> Attached is my script as it is right now (ncl_out.ncl). I've tried
>>>>>> a coupled of different approaches to running the diagnostics for a
>>>>>> domain that is 379(w_e) x 514 (s_n) and 673 timesteps. Using the
>>>>>> diagnostic SLP and MDBZ as examples...
>>>>>>
>>>>>> Version 1:
>>>>>> Original use was to create an array (ar_3d), save SLP to array at
>>>>>> every time step, save array to netCDF file, then overwrite the array
>>>>>> for the next variable (MDBZ). This way I was only using one large
>>>>>> array (379x514x673) to pull out the variables. This script would
>>>>>> crash after ~120 timesteps (not sure exactly where as it doesn't
>>>>>> save the array until the end and would not print out the times right
>>>>>> before dying). This way was probably the least efficient use of memory.
>>>>>> *********************************************************************************
>>>>>> ;create diagnostics array
>>>>>> ar_3d = new( (/ ntimes, south_north, west_east /), float)
>>>>>>
>>>>>> ;Diagnostics
>>>>>> do time = 0,ntimes-1,1 ;SLP
>>>>>> slp = wrf_user_getvar(a,"slp",time) ; Sea level
>>>>>> pressure in hPa
>>>>>> ar_3d(time,:,:)=slp
>>>>>> print(time)
>>>>>> end do
>>>>>> ncf->slp = ar_3d
>>>>>>
>>>>>> do time = 0,ntimes-1,1 ;MDBZ
>>>>>> mdbz = wrf_user_getvar(a,"mdbz",time) ; Simulated radar refl.
>>>>>> ar_3d(time,:,:)=mdbz
>>>>>> print(time)
>>>>>> end do
>>>>>> ncf->mdbz = ar_3d
>>>>>> *********************************************************************************
>>>>>>
>>>>>> Version 2:
>>>>>> I assumed that this array was getting too large and was choking my
>>>>>> memory. My next attempt was to save the output directly to the
>>>>>> netCDF file and not use an array in the middle. This script would
>>>>>> crash after 234 timesteps (the netCDF file wrote 234 frames and then
>>>>>> died while processing the 235th).
>>>>>> *********************************************************************************
>>>>>> ;compute diagnostics and export to netCDF file
>>>>>> do time = 0,ntimes-1,1 ;SLP
>>>>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/)
>>>>>> print(time)
>>>>>> end do
>>>>>>
>>>>>> ;compute diagnostics and export to netCDF file
>>>>>> do time = 0,ntimes-1,1 ;MDBZ
>>>>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/)
>>>>>> print(time)
>>>>>> end do
>>>>>> *********************************************************************************
>>>>>>
>>>>>> Version 3:
>>>>>> My last attempt was to bunch all of the diagnostics within the same
>>>>>> time loop. I'm not sure how efficient this is for I/O, it didn't
>>>>>> seem to take much longer and it saved all 3 diagnostic variables to
>>>>>> the netCDF file. This one stopped at 241 timesteps.
>>>>>> *********************************************************************************
>>>>>> ;compute diagnostics and export to netCDF file
>>>>>> do time = 0,ntimes-1,1
>>>>>> ncf->slp(time,:,:) = (/wrf_user_getvar(a,"slp",time)/) ;SLP
>>>>>> ncf->mdbz(time,:,:) = (/wrf_user_getvar(a,"mdbz",time)/) ;MDBZ
>>>>>> ncf->td2(time,:,:) = (/wrf_user_getvar(a,"td2",time)/) ;TD2
>>>>>> print(time)
>>>>>> end do
>>>>>> *********************************************************************************
>>>>>>
>>>>>> If anyone has any advice on why my script might be crashing, it
>>>>>> would be greatly appreciated. I'm chasing the path down memory
>>>>>> allocation (should be a max of 4Gb per node), although I may be way
>>>>>> off...
>>>>>>
>>>>>> CPU time : 1882.70 sec.
>>>>>> Max Memory : 3256 MB
>>>>>> Max Swap : 11190 MB
>>>>>>
>>>>>> Max Processes : 10
>>>>>> Max Threads : 12
>>>>>>
>>>>>> Also attached is the text outfile (ncl.out) from my last analysis,
>>>>>> they're all more or less the same (except the length of analysis).
>>>>>> As shown in the outfile, I'm using NCL v. 5.2.0.
>>>>>>
>>>>>> Of course, any general critiques on how to make this NCL script more
>>>>>> efficient would also be greatly appreciated.
>>>>>>
>>>>>> Please let me know if I neglected to mention anything important.
>>>>>> Thanks!
>>>>>>
>>>>>> -Joe
>>>>>>
>>>>>>
>>>>>>
>>>>>> Joseph B. Zambon
>>>>>> jbzambon@ncsu.edu <mailto:jbzambon@ncsu.edu>
>>>>>> NC State University
>>>>>> Department of Marine, Earth and Atmospheric Sciences
>>>>>>
>>>>>>
>>>>>> <ncl_out.ncl.txt>
>>>>>>
>>>>>> <ncl.out>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ncl-talk mailing list
>>>>>> List instructions, subscriber options, unsubscribe:
>>>>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>>>>
>>>>
>>>
>>> =
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> ncl-talk mailing list
>>> List instructions, subscriber options, unsubscribe:
>>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>>
>> _______________________________________________
>> ncl-talk mailing list
>> List instructions, subscriber options, unsubscribe:
>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>
> _______________________________________________
> ncl-talk mailing list
> List instructions, subscriber options, unsubscribe:
> http://mailman.ucar.edu/mailman/listinfo/ncl-talk

Received on Mon Jan 24 13:35:34 2011

This archive was generated by hypermail 2.1.8 : Tue Jan 25 2011 - 14:22:15 MST