ncl-talk 2013 archive: Re: Filtering erroneous data from vertica

From: Kevin Vermeesch <kevin.c.vermeesch_at_nyahnyahspammersnyahnyah>
Date: Tue May 14 2013 - 09:37:12 MDT

Hi Brian,
I haven't worked with the ARM profiler data you attached in the original
message, but I've spent a fair amount of time trying to do this same
thing with other wind data sets. This goes into the area of robust data
analysis techniques where you try to calculate statistics with methods
that identify and ignore bad data points. I don't have any of these
methods coded in NCL (and even if I did my employer would frown upon me
sending code to people), but I can describe what I used to filter out
bad points.

Here's some references that may contain helpful information and that I
have used:

Hoaglin, D. C., F. Mosteller, and J. W. Tukey, 1983: Understanding
Robust and Exploratory Data Analysis. Wiley.
(specifically, I used chapter 2 (I think) titled "Letter values: a set
of selected order statistics." on pages 33-57)

Lanzante, J. R., 1996: Resistant, robust, and non-parametric techniques
for the analysis of climate data: theory and examples, including
applications to historical radiosonde station data. International
Journal of Climatology, 16, 1197-1226.

http://pubs.usgs.gov/twri/twri4a3/pdf/twri4a3-new.pdf
(I used Section 10.4 Iteratively Weighted Least Squares on printed page 283)

http://www.mathworks.com/help/stats/robustfit.html
(The MATLAB Statistics Toolbox has a function (robustfit) that is
essentially the iteratively weighted least squares from the USGS link above)

Using the USGS and Mathworks links, I wrote my own iteratively weighted
least squares routine (i.e. my own version of robustfit in the Mathworks
link). For the profiler data, I would calculate a robust linear
regression of say wind speed magnitude vs. height. Next, assuming the
regression line was minimally influenced by outliers (that can be
confirmed by plotting it with the data points), I would calculate the
distance of each point from the regression line. Then, having a
distribution consisting of distances of points from the regression line,
I would use the technique in section 2C of the Hoaglin reference to
identify outliers and exclude them from further calculations. Instead of
using the technique from Hoaglin, you could also calculate the
mean/median of the distances and say that outlier are so many standard
deviations away from the regression line. This may not get all the bad
points and it could remove some good/valid data, but I've had success
using it.

What I've also done is take each profile at a time and see if there are
large differences between height levels or a layer of levels to find
where bad data might exist (if the speed difference between adjacent
levels is above some user-defined threshold), essentially "walking" up
the profile. As a last resort, I'll use a hard-coded threshold, but only
after these have failed. I understand the frustration of trying to
remove bad data, so hopefully this is enough to get you started.
Kevin

On 5/13/2013 11:24 AM, brianjs @iastate.edu wrote:
> Good Afternoon,
>
> While I have removed the missing data (listed as 999999 in the
> datasets) and replaced it with -9999 to have it removed from plotting,
> there is still erroneous data (as can be seen in the first 2 plots
> listed). Most of these of course, are illustrated in dark red.
> Comparing these plots to the ones posted on the ARM data sites for
> quick viewing (attachments 3, and 4, the bottom image with the low
> power settings), these faulty data are left in. To avoid confusion,
> please note that my plots reduce the windbarb density somewhat and
> incorporate 36 hour observations, concatenated, with x-axis labeling
> every 6 hours, and data plotted every 3 hours.
>
> I however, want to remove these random high wind barbs and I have
> attempted using the 'where' command to filter some of these points.
> The idea with this command is to filter out the missing data and
> exclude the bad points at the same time. As such, I had set a
> threshold for exclusion of data for the wnd_new variable, where I am
> essentially excluding the magnitude of the wind at a certain point. I
> did it this way instead of setting thresholds for the u and v
> components because lower in magnitude u and v components (both
> negative and positive) must be excluded in order to nail the bad data
> points. I have attached the code (attachment 5) to show specifically
> what I am doing. The problem with the u, v component threshold
> approach is that it wipes out much of my good data as well. As can be
> seen by image #5 (attachment 6), the wnd magnitude exclusion approach
> does not thoroughly remove the bad data. It just removes the color
> coding essentially.
>
> Would anyone happen to know a more rigorous and effective way of
> excluding these seemingly faulty data points? I know they can be hard
> coded out, but that seems to be the least practical method for
> handling this issue. Any help would be greatly appreciated.
>
> Lastly, feel free to also send me a reply directly to my email in
> addition to the ncl-talk as I do not seem to receive replies in my box
> (usually someone in the office also on NCL talk fills me in on the
> email replies). Thank you again!
>
> My email (in case it does not show) is brianjs@iastate.edu
> <mailto:brianjs@iastate.edu>
>
> Brian Squitieri
> Graduate Assistant
> Iowa State University

-- 
* * * * * * * * * * * *
Kevin Vermeesch
Science Systems and Applications, Inc.
NASA/GSFC Code 612
Building 33, Room C422
e-mail: kevin.c.vermeesch@nasa.gov

_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Tue May 14 09:37:25 2013

This archive was generated by hypermail 2.1.8 : Wed May 15 2013 - 10:19:28 MDT

Re: Filtering erroneous data from vertical profiles