Re: A problem with the Spearman rank correlation function (spcorr)

From: Daran Rife <drife_at_nyahnyahspammersnyahnyah> Date: Tue Feb 22 2011 - 10:49:03 MST · This archive was generated by hypermail 2.1.8 : Wed Feb 23 2011 - 16:47:57 MST

--
On Feb 22, 2011, at 9:55 AM, Dennis Shea wrote:
> 
> Hi Darren
> 
> 
> http://www.answers.com/topic/spearman-s-rank-correlation-coefficient
> 
> If there are no tied ranks, then ρ is given by:[1][2]
> 
>    \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}.
> 
> If tied ranks exist, Pearson's correlation coefficient between ranks should be used for the calculation[1]:
> 
>    r = \frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_i (x_i-\bar{x})^2 \sum_i(y_i-\bar{y})^2}}.
> 
> One has to assign the same rank to each of the equal values. It is an average of their positions in the ascending order of the values.
> 
> 
> Read more: http://www.answers.com/topic/spearman-s-rank-correlation-coefficient#ixzz1EhqkWAmX
> 
> ===============
> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect014.htm
> 
> This [SAS] also notes "PROC CORR computes the Spearman correlation by ranking the data and using the ranks in the Pearson product-moment correlation formula. In case of ties, the averaged ranks are used."
> 
> I do not see any averaging of ranks in the fortran code used
> by NCL. Hence, I *speculate* the R result might take into
> account "averaged ranks". Do u have the source code
> used by "R" ?
> 
> D
> 
> 
> 
> On 2/21/11 1:45 PM, Daran Rife wrote:
>> Hi Dennis,
>> 
>> Thanks for the additional information and results. Emilie has gone on
>> maternity leave so you're not likely to hear from her for a while. We
>> have been working this issue together. The R code to calculate the
>> Spearman rank correlation looks like this:
>> 
>> x<- read.table("data_Cabauw_MERRA_wspd-iqrln_full1.dat")
>> y<- read.table("data_Cabauw_MERRA_wspd-iqrln_hist_wspd-iqrln-12-1.dat")
>> spear<- cor(x$V1,y$V1, method="spearman)
>> print(spear)
>> [1] 0.07144465
>> spear^2
>> [1] 0.005104339
>> 
>> I have the same suspicion about how the two methods deal with ties when
>> ranking the data before computing the correlation. Our data have many,
>> many non-zero values whose frequency is the identical. Thus, there are
>> many ties when the frequencies are ranked. Like you, I see nothing in
>> NCL's Fortran-based routine that indicates ties are an issue.
>> 
>> The R documentation states "Spearman's rho statistic is used to
>> estimate a
>> rank-based measure of association. These are more robust and have been
>> recommended if the data do not necessarily come from a bivariate normal
>> distribution."
>> 
>> And I emulated your experiment by removing the non-zero elements.
>> 
>> ind<- which(x != 0&  y != 0)
>> spear_nonzero<- cor(x$V1[ind], y$V1[ind], method="spearman")
>> Warning message:
>> In cor(x$V1[ind], y$V1[ind], method = "spearman") :
>>    the standard deviation is zero
>> 
>> The correlation is NA, because x$V1[ind] and y$V1[ind] are perfectly
>> correlated, which can be readily seen by looking at x$V1[ind] and
>> y$V1[ind].
>> 
>>  >  x$V1[ind]
>>   [1] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
>>   [7] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
>>  >  y$V1[ind]
>>   [1] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
>>   [7] 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288 0.03042288
>> 
>> This again, is very different that what NCL's spcorr function returns.
>> I assume you intended to remove all non-zero values, but your function
>> doesn't appear to do this.
>> 
>> i = ind( .not.(x.eq.0 .and. y.eq.0) )
>> 
>> print(y(i))
>> 
>> (0)      0
>> (1)      0
>> (2)      0
>> (3)     0.03042288
>> (4)     0.03042288
>> (5)      0
>> (6)      0
>> (7)      0
>> (8)      0
>> (9)      0
>> (10)     0
>> (11)     0
>> (12)     0
>> (13)     0
>> (14)     0
>> (15)     0
>> (16)     0
>> (17)     0
>> .
>> .
>> .
>> 
>> I tried to modify your statement to select only non-zero elements from
>> both x and y but I get the following error, which clearly indicates
>> that there are a different number of non-zero elements in x and y.
>> 
>> i = ind(x.ne.0 .and. y.ne.0)
>> fatal:Dimension sizes of left hand side and right hand side of
>> assignment do not match
>> fatal:Execute: Error occurred at or near line 40
>> 
>> Because I am not an NCL expert, I don't know how to compute the inter-
>> section between the non-zero element indices in x and y. In any case,
>> thanks for looking into this further.
>> 
>> 
>> Sincerely,
>> 
>> 
>> Daran
>> --
>> 
>> Message: 1
>> Date: Sun, 20 Feb 2011 14:05:09 -0700
>> From: Dennis Shea<shea@ucar.edu>
>> Subject: Re: [ncl-talk] A problem with the Spearman rank correlation
>> 	function (spcorr)
>> To: Emilie Vanvyve<evanvyve@ucar.edu>
>> Cc: ncl-talk@ucar.edu
>> Message-ID:<4D618205.2030200@ucar.edu>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>> 
>> x = asciiread("data_Cabauw_MERRA_wspd-iqrln_full1.dat",-1,"float")
>> y = asciiread("data_Cabauw_MERRA_wspd-iqrln_hist_wspd-iqrln12-1.dat",-1,
>> r = spcorr(x,y)  ; r = 0.6188006    ; r*r=0.3829142
>> 
>> Just for 'fun', I also tried
>> 
>> i = ind( .not.(x.eq.0 .and. y.eq.0) )
>> R = spcorr(x(i),y(i))   ; R = 0.3943804   ; R*R = 0.1555359
>> ====
>> 
>> I speculate it must have something to do with the
>> many ties [ x(n)=y(n)=0 ]. There is nothing in
>> the fortran code to indicate ties are an issue.
>> 
>> ====
>> The R code shows no rank correlation. Is that what you expect?
>> _______________________________________________
>> ncl-talk mailing list
>> List instructions, subscriber options, unsubscribe:
>> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk