NCL Home > Documentation > Functions > General applied math, Statistics

kolsm2_n

Uses the Kolmogorov-Smirnov two-sample test to determine if two samples are from the same distribution.

Available in version 6.2.0 and later.

Prototype

	function kolsm2_n (
		x        : numeric,  
		y        : numeric,  
		dims [*] : integer   
	)

	return_val  :  float or double

Arguments

x
y

Arrays of any dimensionality. The rank of the arrays must be the same. However, the dimension(s) specified by dims may be of different sizes. (See Examples.) All other dimensions must match. At a minimum, the sample sizes should be greater than 100. Missing data are not allowed. It is the user's responsibility to remove missing values prior to calling the function.

dims

The dimension(s) of x and y on which to calculate the statistic. They must be consecutive and increasing.

The dimension sizes of x and y may be different but the rank must be the same.
If dims=-1, then the entire arrays will be used.

Return value

Probability that the distributions are the same. The return type will be double if either x or y, is type double, and float otherwise. In addition, two ancillary statistics used to calculate the probability are returned as attributes, dstat and zstat. These are used to compute the returned probability. For the kolsm2_n function, these are defined as


        dstat = abs(x-y) 

        zstat = sqrt((M*N)/(M+N))*dstat)
    where
        M, N are the dimension sizes of x and y.

Description

Note: a bug was found in which this routine doesn't work across multiple-dimensioned arrays. This will be fixed in V6.4.0.

You can work around this by using loops, which will be slower:

; Assume:
;   TS_2 is dimensioned nyears1 x nlat x nlon 
;   TS_3 is dimensioned nyears2 x nlat x nlon 
;
  dims = dimsizes(TS_2)
  nlat = dims(1)
  nlon = dims(2)
  ks   = new((/nlat,nlon/),typeof(TS_2))
  ds   = new((/nlat,nlon/),typeof(TS_2))
  zs   = new((/nlat,nlon/),typeof(TS_2))
  do ilat = 0, nlat-1
    do ilon = 0, nlon-1
      ks_single     = kolsm2_n(TS_2(:,ilat,ilon),TS_3(:,ilat,ilon),0)
      ks(ilat,ilon) = ks_single
      ds(ilat,ilon) = ks_single@dstat
      zs(ilat,ilon) = ks_single@zstat
    end do
  end do

The Kolmogorov-Smirnov (KS) two-sample test determines if two samples are from the same parent distribution. The KS test is non-parametric and distribution free. i.e.,: It makes no assumption about the distribution of data. The statistic compares cumulative distributions of two data samples. A large difference between the two cumulative sample distributions indicates that data are not drawn from the same distribution.

From Wikipedia: "The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples."

The null hypothesis is that both groups were sampled from populations with identical distributions. Hence, if the returned p-value is small (eg, < 0.05), the two data sets were likely sampled from populations with different distributions and the null hypothesis can be rejected.

The kolsm2_n function does not test for 'ties'. This test should only be used when ties are a very small percent of the entire samples.

This function sorts the x and y subsets before doing the calculation. As a result, large datasets will take time to perfom the required operations. The original input arrays are not changed.

NOTE: Any time a distribution (here, two distributions) is/are tested, the user should realize that there is no substitute for large sample size(s). A minimum 'large' size would be at least 100 values.

Examples

See note above about bug with multiple dimensions.

As always, it is best if the sample sizes of X and Y are large.

Example 1

Consider two small arrays.

   x = (/15.7, 16.1, 15.9, 16.2, 15.9, 16.0, 15.8, 16.1, 16.3, 16.5, 15.5/)
   y = (/15.4, 16.0, 15.6, 15.7, 16.6, 16.3, 16.4, 16.8, 15.2, 16.9, 15.1/)

   p = kolsm2_n(x,y,0)     ; p=0.808 ; p@dstat = 0.2727; p@zstat= 0.639
                           ; can not reject null hypothesis (H0)
   print(p)

The output from the print statement is: