From: Kevin Vermeesch <kevin.c.vermeesch_at_nyahnyahspammersnyahnyah>

Date: Tue May 14 2013 - 09:37:12 MDT

Date: Tue May 14 2013 - 09:37:12 MDT

Hi Brian,

I haven't worked with the ARM profiler data you attached in the original

message, but I've spent a fair amount of time trying to do this same

thing with other wind data sets. This goes into the area of robust data

analysis techniques where you try to calculate statistics with methods

that identify and ignore bad data points. I don't have any of these

methods coded in NCL (and even if I did my employer would frown upon me

sending code to people), but I can describe what I used to filter out

bad points.

Here's some references that may contain helpful information and that I

have used:

Hoaglin, D. C., F. Mosteller, and J. W. Tukey, 1983: Understanding

Robust and Exploratory Data Analysis. Wiley.

(specifically, I used chapter 2 (I think) titled "Letter values: a set

of selected order statistics." on pages 33-57)

Lanzante, J. R., 1996: Resistant, robust, and non-parametric techniques

for the analysis of climate data: theory and examples, including

applications to historical radiosonde station data. International

Journal of Climatology, 16, 1197-1226.

http://pubs.usgs.gov/twri/twri4a3/pdf/twri4a3-new.pdf

(I used Section 10.4 Iteratively Weighted Least Squares on printed page 283)

http://www.mathworks.com/help/stats/robustfit.html

(The MATLAB Statistics Toolbox has a function (robustfit) that is

essentially the iteratively weighted least squares from the USGS link above)

Using the USGS and Mathworks links, I wrote my own iteratively weighted

least squares routine (i.e. my own version of robustfit in the Mathworks

link). For the profiler data, I would calculate a robust linear

regression of say wind speed magnitude vs. height. Next, assuming the

regression line was minimally influenced by outliers (that can be

confirmed by plotting it with the data points), I would calculate the

distance of each point from the regression line. Then, having a

distribution consisting of distances of points from the regression line,

I would use the technique in section 2C of the Hoaglin reference to

identify outliers and exclude them from further calculations. Instead of

using the technique from Hoaglin, you could also calculate the

mean/median of the distances and say that outlier are so many standard

deviations away from the regression line. This may not get all the bad

points and it could remove some good/valid data, but I've had success

using it.

What I've also done is take each profile at a time and see if there are

large differences between height levels or a layer of levels to find

where bad data might exist (if the speed difference between adjacent

levels is above some user-defined threshold), essentially "walking" up

the profile. As a last resort, I'll use a hard-coded threshold, but only

after these have failed. I understand the frustration of trying to

remove bad data, so hopefully this is enough to get you started.

Kevin

On 5/13/2013 11:24 AM, brianjs @iastate.edu wrote:

*> Good Afternoon,
*

*>
*

*> While I have removed the missing data (listed as 999999 in the
*

*> datasets) and replaced it with -9999 to have it removed from plotting,
*

*> there is still erroneous data (as can be seen in the first 2 plots
*

*> listed). Most of these of course, are illustrated in dark red.
*

*> Comparing these plots to the ones posted on the ARM data sites for
*

*> quick viewing (attachments 3, and 4, the bottom image with the low
*

*> power settings), these faulty data are left in. To avoid confusion,
*

*> please note that my plots reduce the windbarb density somewhat and
*

*> incorporate 36 hour observations, concatenated, with x-axis labeling
*

*> every 6 hours, and data plotted every 3 hours.
*

*>
*

*> I however, want to remove these random high wind barbs and I have
*

*> attempted using the 'where' command to filter some of these points.
*

*> The idea with this command is to filter out the missing data and
*

*> exclude the bad points at the same time. As such, I had set a
*

*> threshold for exclusion of data for the wnd_new variable, where I am
*

*> essentially excluding the magnitude of the wind at a certain point. I
*

*> did it this way instead of setting thresholds for the u and v
*

*> components because lower in magnitude u and v components (both
*

*> negative and positive) must be excluded in order to nail the bad data
*

*> points. I have attached the code (attachment 5) to show specifically
*

*> what I am doing. The problem with the u, v component threshold
*

*> approach is that it wipes out much of my good data as well. As can be
*

*> seen by image #5 (attachment 6), the wnd magnitude exclusion approach
*

*> does not thoroughly remove the bad data. It just removes the color
*

*> coding essentially.
*

*>
*

*> Would anyone happen to know a more rigorous and effective way of
*

*> excluding these seemingly faulty data points? I know they can be hard
*

*> coded out, but that seems to be the least practical method for
*

*> handling this issue. Any help would be greatly appreciated.
*

*>
*

*> Lastly, feel free to also send me a reply directly to my email in
*

*> addition to the ncl-talk as I do not seem to receive replies in my box
*

*> (usually someone in the office also on NCL talk fills me in on the
*

*> email replies). Thank you again!
*

*>
*

*> My email (in case it does not show) is brianjs@iastate.edu
*

*> <mailto:brianjs@iastate.edu>
*

*>
*

*> Brian Squitieri
*

*> Graduate Assistant
*

*> Iowa State University
*

-- * * * * * * * * * * * * Kevin Vermeesch Science Systems and Applications, Inc. NASA/GSFC Code 612 Building 33, Room C422 e-mail: kevin.c.vermeesch@nasa.gov

_______________________________________________

ncl-talk mailing list

List instructions, subscriber options, unsubscribe:

http://mailman.ucar.edu/mailman/listinfo/ncl-talk

Received on Tue May 14 09:37:25 2013

*
This archive was generated by hypermail 2.1.8
: Wed May 15 2013 - 10:19:28 MDT
*