Re: A problem with the Spearman rank correlation function (spcorr)

From: Daran Rife <drife_at_nyahnyahspammersnyahnyah>
Date: Tue Feb 22 2011 - 10:49:03 MST

Hi Dennis,

I'll spare you the rather lengthy R source code for the cor() function
(there is a heap of error checking), but I can tell you that when the
cor(x,y,method="spearman") is called, it calls another intrinsic R
function named "rank" to rank the values. By default, rank computes the
average "position" when there are ties in the data, and cor() uses the
default behavior of rank, so that must be the difference between the two
methods.

Sorry I forgot to include details about the rho metric, but I see you
found it. This is the most commonly used technique by statisticians.

 \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}

Thanks again for helping us sort out the differences. NCL is definitely
an integral part of our tool suite, as is R, Python, NCO, and CDO, so
its helpful to understand where the various tools might differ.

Appreciate your attention to detail and your sticking with this unusual
request for help.

Daran

--
On Feb 22, 2011, at 9:55 AM, Dennis Shea wrote:
> 
> Hi Darren
> 
> 
> http://www.answers.com/topic/spearman-s-rank-correlation-coefficient
> 
> If there are no tied ranks, then ρ is given by:[1][2]
> 
>    \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}.
> 
> If tied ranks exist, Pearson's correlation coefficient between ranks should be used for the calculation[1]:
> 
>    r = \frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_i (x_i-\bar{x})^2 \sum_i(y_i-\bar{y})^2}}.
> 
> One has to assign the same rank to each of the equal values. It is an average of their positions in the ascending order of the values.
> 
> 
> Read more: http://www.answers.com/topic/spearman-s-rank-correlation-coefficient#ixzz1EhqkWAmX
> 
> ===============
> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect014.htm
> 
> This [SAS] also notes "PROC CORR computes the Spearman correlation by ranking the data and using the ranks in the Pearson product-moment correlation formula. In case of ties, the averaged ranks are used."
> 
> I do not see any averaging of ranks in the fortran code used
> by NCL. Hence, I *speculate* the R result might take into
> account "averaged ranks". Do u have the source code
> used by "R" ?
> 
> D
> 
> 
> 
> On 2/21/11 1:45 PM, Daran Rife wrote:
>> Hi Dennis,
>> 
>> Thanks for the additional information and results. Emilie has gone on
>> maternity leave so you're not likely to hear from her for a while. We
>> have been working this issue together. The R code to calculate the
>> Spearman rank correlation looks like this:
>> 
>> x<- read.table("data_Cabauw_MERRA_wspd-iqrln_full1.dat")
>> y<- read.table("data_Cabauw_MERRA_wspd-iqrln_hist_wspd-iqrln-12-1.dat")
>> spear<- cor(x$V1,y$V1, method="spearman)
>> print(spear)
>> [1] 0.07144465
>> spear^2
>> [1] 0.005104339
>> 
>> I have the same suspicion about how the two methods deal with ties when
>> ranking the data before computing the correlation. Our data have many,
>> many non-zero values whose frequency is the identical. Thus, there are
>> many ties when the frequencies are ranked. Like you, I see nothing in
>> NCL's Fortran-based routine that indicates ties are an issue.
>> 
>> The R documentation states "Spearman's rho statistic is used to
>> estimate a
>> rank-based measure of association. These are more robust and have been
>> recommended if the data do not necessarily come from a bivariate normal
>> distribution."
>> 
>> And I emulated your experiment by removing the non-zero elements.
>> 
>> ind<- which(x != 0&  y != 0)
>> spear_nonzero<- cor(x$V1[ind], y$V1[ind], method="spearman")
>> Warning message:
>> In cor(x$V1[ind], y$V1[ind], method = "spearman") :
>>    the standard deviation is zero
>> 
>> The correlation is NA, because x$V1[ind] and y$V1[ind] are perfectly
>> correlated, which can be readily seen by looking at x$V1[ind] and
>> y$V1[ind].
>> 
>>  >  x$V1[ind]
>>   [1] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
>>   [7] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
>>  >  y$V1[ind]
>>   [1] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
>>   [7] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
>> 
>> This again, is very different that what NCL's spcorr function returns.
>> I assume you intended to remove all non-zero values, but your function
>> doesn't appear to do this.
>> 
>> i = ind( .not.(x.eq.0 .and. y.eq.0) )
>> 
>> print(y(i))
>> 
>> (0)      0
>> (1)      0
>> (2)      0
>> (3)     0.03042288
>> (4)     0.03042288
>> (5)      0
>> (6)      0
>> (7)      0
>> (8)      0
>> (9)      0
>> (10)     0
>> (11)     0
>> (12)     0
>> (13)     0
>> (14)     0
>> (15)     0
>> (16)     0
>> (17)     0
>> .
>> .
>> .
>> 
>> I tried to modify your statement to select only non-zero elements from
>> both x and y but I get the following error, which clearly indicates
>> that there are a different number of non-zero elements in x and y.
>> 
>> i = ind(x.ne.0 .and. y.ne.0)
>> fatal:Dimension sizes of left hand side and right hand side of
>> assignment do not match
>> fatal:Execute: Error occurred at or near line 40
>> 
>> Because I am not an NCL expert, I don't know how to compute the inter-
>> section between the non-zero element indices in x and y. In any case,
>> thanks for looking into this further.
>> 
>> 
>> Sincerely,
>> 
>> 
>> Daran
>> --
>> 
>> Message: 1
>> Date: Sun, 20 Feb 2011 14:05:09 -0700
>> From: Dennis Shea<shea@ucar.edu>
>> Subject: Re: [ncl-talk] A problem with the Spearman rank correlation
>> 	function (spcorr)
>> To: Emilie Vanvyve<evanvyve@ucar.edu>
>> Cc: ncl-talk@ucar.edu
>> Message-ID:<4D618205.2030200@ucar.edu>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>> 
>> x = asciiread("data_Cabauw_MERRA_wspd-iqrln_full1.dat",-1,"float")
>> y = asciiread("data_Cabauw_MERRA_wspd-iqrln_hist_wspd-iqrln12-1.dat",-1,
>> r = spcorr(x,y)  ; r = 0.6188006    ; r*r=0.3829142
>> 
>> Just for 'fun', I also tried
>> 
>> i = ind( .not.(x.eq.0 .and. y.eq.0) )
>> R = spcorr(x(i),y(i))   ; R = 0.3943804   ; R*R = 0.1555359
>> ====
>> 
>> I speculate it must have something to do with the
>> many ties [ x(n)=y(n)=0 ]. There is nothing in
>> the fortran code to indicate ties are an issue.
>> 
>> ====
>> The R code shows no rank correlation. Is that what you expect?
>> _______________________________________________
>> ncl-talk mailing list
>> List instructions, subscriber options, unsubscribe:
>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Tue Feb 22 10:49:09 2011

This archive was generated by hypermail 2.1.8 : Wed Feb 23 2011 - 16:47:57 MST