Re: A problem with the Spearman rank correlation function (spcorr)

From: Dennis Shea <shea_at_nyahnyahspammersnyahnyah>
Date: Tue Feb 22 2011 - 09:55:29 MST

Hi Darren

http://www.answers.com/topic/spearman-s-rank-correlation-coefficient

If there are no tied ranks, then ρ is given by:[1][2]

     \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}.

If tied ranks exist, Pearson's correlation coefficient between ranks
should be used for the calculation[1]:

     r = \frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_i
(x_i-\bar{x})^2 \sum_i(y_i-\bar{y})^2}}.

One has to assign the same rank to each of the equal values. It is an
average of their positions in the ascending order of the values.

Read more:
http://www.answers.com/topic/spearman-s-rank-correlation-coefficient#ixzz1EhqkWAmX

===============
http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect014.htm

This [SAS] also notes "PROC CORR computes the Spearman correlation by
ranking the data and using the ranks in the Pearson product-moment
correlation formula. In case of ties, the averaged ranks are used."

I do not see any averaging of ranks in the fortran code used
by NCL. Hence, I *speculate* the R result might take into
account "averaged ranks". Do u have the source code
used by "R" ?

D

On 2/21/11 1:45 PM, Daran Rife wrote:
> Hi Dennis,
>
> Thanks for the additional information and results. Emilie has gone on
> maternity leave so you're not likely to hear from her for a while. We
> have been working this issue together. The R code to calculate the
> Spearman rank correlation looks like this:
>
> x<- read.table("data_Cabauw_MERRA_wspd-iqrln_full1.dat")
> y<- read.table("data_Cabauw_MERRA_wspd-iqrln_hist_wspd-iqrln-12-1.dat")
> spear<- cor(x$V1,y$V1, method="spearman)
> print(spear)
> [1] 0.07144465
> spear^2
> [1] 0.005104339
>
> I have the same suspicion about how the two methods deal with ties when
> ranking the data before computing the correlation. Our data have many,
> many non-zero values whose frequency is the identical. Thus, there are
> many ties when the frequencies are ranked. Like you, I see nothing in
> NCL's Fortran-based routine that indicates ties are an issue.
>
> The R documentation states "Spearman's rho statistic is used to
> estimate a
> rank-based measure of association. These are more robust and have been
> recommended if the data do not necessarily come from a bivariate normal
> distribution."
>
> And I emulated your experiment by removing the non-zero elements.
>
> ind<- which(x != 0& y != 0)
> spear_nonzero<- cor(x$V1[ind], y$V1[ind], method="spearman")
> Warning message:
> In cor(x$V1[ind], y$V1[ind], method = "spearman") :
> the standard deviation is zero
>
> The correlation is NA, because x$V1[ind] and y$V1[ind] are perfectly
> correlated, which can be readily seen by looking at x$V1[ind] and
> y$V1[ind].
>
> > x$V1[ind]
> [1] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
> [7] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
> > y$V1[ind]
> [1] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
> [7] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
>
> This again, is very different that what NCL's spcorr function returns.
> I assume you intended to remove all non-zero values, but your function
> doesn't appear to do this.
>
> i = ind( .not.(x.eq.0 .and. y.eq.0) )
>
> print(y(i))
>
> (0) 0
> (1) 0
> (2) 0
> (3) 0.03042288
> (4) 0.03042288
> (5) 0
> (6) 0
> (7) 0
> (8) 0
> (9) 0
> (10) 0
> (11) 0
> (12) 0
> (13) 0
> (14) 0
> (15) 0
> (16) 0
> (17) 0
> .
> .
> .
>
> I tried to modify your statement to select only non-zero elements from
> both x and y but I get the following error, which clearly indicates
> that there are a different number of non-zero elements in x and y.
>
> i = ind(x.ne.0 .and. y.ne.0)
> fatal:Dimension sizes of left hand side and right hand side of
> assignment do not match
> fatal:Execute: Error occurred at or near line 40
>
> Because I am not an NCL expert, I don't know how to compute the inter-
> section between the non-zero element indices in x and y. In any case,
> thanks for looking into this further.
>
>
> Sincerely,
>
>
> Daran
> --
>
> Message: 1
> Date: Sun, 20 Feb 2011 14:05:09 -0700
> From: Dennis Shea<shea@ucar.edu>
> Subject: Re: [ncl-talk] A problem with the Spearman rank correlation
> function (spcorr)
> To: Emilie Vanvyve<evanvyve@ucar.edu>
> Cc: ncl-talk@ucar.edu
> Message-ID:<4D618205.2030200@ucar.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> x = asciiread("data_Cabauw_MERRA_wspd-iqrln_full1.dat",-1,"float")
> y = asciiread("data_Cabauw_MERRA_wspd-iqrln_hist_wspd-iqrln12-1.dat",-1,
> r = spcorr(x,y) ; r = 0.6188006 ; r*r=0.3829142
>
> Just for 'fun', I also tried
>
> i = ind( .not.(x.eq.0 .and. y.eq.0) )
> R = spcorr(x(i),y(i)) ; R = 0.3943804 ; R*R = 0.1555359
> ====
>
> I speculate it must have something to do with the
> many ties [ x(n)=y(n)=0 ]. There is nothing in
> the fortran code to indicate ties are an issue.
>
> ====
> The R code shows no rank correlation. Is that what you expect?
> _______________________________________________
> ncl-talk mailing list
> List instructions, subscriber options, unsubscribe:
> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Tue Feb 22 09:55:36 2011

This archive was generated by hypermail 2.1.8 : Wed Feb 23 2011 - 16:47:57 MST