NCL Home> Application examples> Data Analysis || Data files for some examples

Example pages containing: tips | resources | functions/procedures

NCL: Regression & Trend

Regression:There are four primary regression functions:

(a) regline which performs simple linear regression; y(:)~r*x(:)+y0;
(b) regline_stats which performs linear regression and, additionally, returns confidence estimates and an ANOVA table.
(c) regCoef which performs simple linear regression on multi-dimensional arrays
(d) reg_multlin_stats which performs multiple linear regression (v6.2.0) and , additionally, returns confidence estimates and an ANOVA table.

There are two other regression functions (regcoef, reg_multlin) but these are used less frequently. They are maintained for backward compatibility only.

Linear and multiple linear regression models make a number of assumptions about the independent predictor variable(s) and the dependent response variable (predictand). A primary assumption is that the response variable is a linear combination of the regression coefficients and the predictor variables. Further, the predictor variable(s) are assumed to be uncorrelated.

Trend: In addition to regression, other methods can be used to assess trend. The well known Mann-Kendall non-parametric trend test statistically assesses if there is a monotonic upward or downward trend over some time period. A monotonic upward (downward) trend means that the variable consistently increases (decreases) through time, but the trend may or may not be linear. The MK test can be used in place of a parametric linear regression analysis, which can be used to test if the slope of the estimated linear regression line is different from zero. The regression analysis requires that the residuals from the fitted regression line be normally distributed; an assumption not required by the MK test, that is, the MK test is a non-parametric (distribution-free) test.

The Theil-Sen trend estimation method "is insensitive to outliers; it can be significantly more accurate than simple linear regression for skewed and heteroskedastic data, and competes well against non-robust least squares even for normally distributed data in terms of statistical power. It has been called 'the most popular nonparametric technique for estimating a linear trend'."

The trend_manken function performs both the Mann-Kendall test and Theil-Sen trend estimation. The USA's Environmental Agency provides a basic description of use and interpretation: Statistical Analysis for Monotonic Trends.

regress_1.ncl: Tabular data ( regress_1.txt ) contained within an ascii file are read. The regline function is used to calculate the least squared regression line for a one dimensional array. Here, the regression coefficient is synonymous with the slope or trend.


     Variable: rc

     Number Of Attributes: 7
       yintercept :	275.3205350166454
       yave :	269.5734479950696
       xave :	7618.59756097561
       nptxy :	82
       rstd :	0.000234703
       tval :	-3.21405918

     (0)	-0.000754349

A Python version of this projection is available here.

regress_1a.ncl: See regline_stats Example 1. This uses the information provided in the NCL 6.4.0 release to plot 95% mean response (blue), prediction limits (red) and lines derived using the 95% limits if the slope and y-intercept (green).


       Variable: rc
        Type: float
        Total Size: 4 bytes
                    1 values
        Number of Dimensions: 1
        Dimensions and sizes:   [1]
        Coordinates: 
        Number Of Attributes: 38
          _FillValue :  9.96921e+36

          long_name :   simple linear regression
          model :       Yest = b(0) + b(1)*X

          N :   18                              ; # of observations
          NP :  1                               ; # of predictor variables
          M :   2                               ; # of returned coefficients
          bstd :        (  0, 0.9947125 )       ; standardized regression coefficient
                                                ; [ignore 1st element]

                                                ; ANOVA information: SS=>Sum of Squares
          SST : 4823324                         ; Total SS:      sum((Y-Yavg)^2) 
          SSE : 50871.91                        ; Residual SS:   sum((Yest-Y)^2)
          SSR : 4772452                         ; sum((Yest-Yavg)^2)

          MST : 283724.9
          MSE : 3179.494
          MSE_dof :     16
          MSR : 4772452

          RSE : 56.387                          ; residual standard error; sqrt(MSE)
          RSE_dof :     15

          F :   1501.01                         ; MSR/MSE
          F_dof :       ( 1, 16 )
          F_pval :      1.51066e-17

          r2 :  0.989453            ; square of the Pearson correlation coefficient 
          r :   0.9947125           ; multiple (overall) correlation: sqrt(r2)
          r2a : 0.9887938           ; adjusted r2... better for small N

          fuv : 0.01054698          ; (1-r2): fraction of variance of the regressand
                                    ; (dependent variable) Y which cannot be explained
                                    ;  i.e., which is not correctly predicted, by the 
                                    ; explanatory variables X.

          Yest :        [ARRAY of 18 elements]  ; Yest = b(0) + b(1)*x 
          Yavg :        2125.278                ; avg(y)
          Ystd :        532.6583                ; stddev(y)
          Xavg :        2165                    ; avg(x)
          Xstd :        543.6721                ; stddev(x)

          stderr :      ( 56.05802, 0.02515461 )     ; std. error of each b
          tval :        ( 0.2738642, 38.74285 )      ; t-value
          pval :        ( 0.7874895, 5.017699e-18 )  ; p-value

                                                ; following added in version 6.4.0
          b95  :        ( 0.9212361, 1.027887 ) ;  2.5% and 97.5% regression coef. confidence intervals
          y95  :        ( -459.9983, 490.703 )  ;  2.5% and 97.5% y-intercept confidence intervals
          YMR025 :      [ARRAY of 18 elements]  ;  2.5% mean response
          YMR975 :      [ARRAY of 18 elements]  ; 97.5% mean response
          YPI025 :      [ARRAY of 18 elements]  ;  2.5% prediction interval
          YPI975 :      [ARRAY of 18 elements]  ; 97.5% prediction interval

                                                ; Original regline attributes 
                                                ; provided for backward cmpatibility
          nptxy : 18
          xave :  2165
          yave :  2125.278
          rstd :  56.387
          yintercept :    15.35228

          b :   ( 15.35228, 0.9745615 )         ; b(0) + b(1)*x

        (0)     0.9745615                       ; regression coefficient

regress_1b.ncl: See regline_stats Example 2. This is a commonly used WWW example taken from: http://www.stat.ucla.edu/~hqxu/stat105/pdf/ch11.pdf. This uses the information provided in the 6.4.0 release to plot 95% mean response and prediction estimates.

regress_1c.ncl: The reciprocal approach is mentioned at the end of http://www.stat.ucla.edu/~hqxu/stat105/pdf/ch11.pdf. R code for processing this is at http://people.stat.sc.edu/Tebbs/stat509/Rcode_CH11.htm.

The point of this example is that one should examine the residuals from the fitted regression line. An obvious feature is that the values at the extremes are lower than the fitted line while the bulk of the middle values are above the fitted line.

It is suggested the the reciprocal (x' = 1/x) be used as the independent variable. The yields

               y' = 2.9789 - 6.9345x'

The scatter about the fiited line is reasonably distributed suggesting that the reciprocal transformation is a better fit than the original model.

regress_2.ncl: Read sea level pressures (time,lat,lon) over the globe and use regCoef to calculate the regression coefficients (aka: slopes, trends). The time units attribute is "days since 1850-01-01 00:00:00". Hence, the trend units are Pa/day. These are changed to Pa/year for nicer plot.

regress_3.ncl: Read northern hemisphere near-surface temperatures spanning 1901-2012 from the 20th Century Reanalysis. Calculate the areal weighted annual mean temperature for each year. Use reg_multlin_stats to calculate the trend and other statistics. Note: regline could also have been used for the regression coefficient.

regress_4.ncl: Read northern hemisphere near-surface temperatures spanning 1951-2010 from the 20th Century Reanalysis. Calculate the areal weighted annual mean temperature for each year. Use regCoef to calculate the trends.

regress_5.ncl: Read data from a table and perform a multiple linear regression using reg_multlin_stats. There is one dependent variable [y] and 6 predictor variables [x]. Details of the "KENTUCKY.txt" data can be found at:

Davis, J.C. (2002): Statistics and Data Analysis in Geology 
Wiley (3rd Edition), pgs: 462-482

The output includes:


[snip]

(0)     ------- ANOVA information-------- 
(0)      
(0)     SST=2934.82  SSR=1800.7  SSE=1134.12
(0)     MST=59.8943  MSR=300.117  MSE=26.3748 RSE=5.13564
(0)     F-statistic=11.3789 dof=(6,43)
(0)               -------                 

  r2 :  0.613565               ; explains 61% of variance
  r :   0.783304
  r2a : 0.559644
  fuv : 0.386435 
[snip]

  stderr :  ( 11.27078, 0.0109590, 0.008168, 0.175949, 0.020587, 0.0021864, 0.072610 )
  tval :    ( -0.19914, 0.4645738, 2.763535,-1.323133, 3.041301,-0.9319575,-1.605605 )
  pval :    (  0.84307, 0.6445273, 0.008318, 0.192626, 0.003961, 0.3564441, 0.115515 )

(0)     -2.24446               ; y-intercept
(1)      0.0050913             ; partial regression coefficients
(2)      0.0225729
(3)     -0.2328043
(4)      0.0626111
(5)     -0.0020377
(6)     -0.1165836

regress_6.ncl: This is based upon a script created by Aixhu Hu (NCAR).

 
     (a) Read two text files via asciiread; 
     (b) Perform a linear regression using regline; 
     (c) Create a 3rd degree polynomial via lspoly_n ;
     (d) Plot markers and the polynomial and regression lines.

Data files: ipoind_8yrTrend.asc and tsind_8yrTrend.asc

The output includes:

    Variable: coef
    Type: float
    Total Size: 16 bytes
                4 values
    Number of Dimensions: 1
    Dimensions and sizes:   [4]
    Coordinates: 
    Number Of Attributes: 1
      _FillValue :  9.96921e+36
    
    (0)     -0.743525
    (1)     180.7366
    (2)     125.8308
    (3)     2630.312  
    (0)     ======
    
    
    Variable: rc
    Type: float
    Total Size: 4 bytes
                1 values
    Number of Dimensions: 1
    Dimensions and sizes:   [1]
    Coordinates: 
    Number Of Attributes: 7
      _FillValue :  9.96921e+36
      yintercept :  -0.08270691
      yave :        -0.14029
      xave :        -0.0002547655
      nptxy :       1175
      rstd :        11.93106
      tval :        18.94417
    (0)     226.024
    (0)     ======

regress_7.ncl: The "Law School Data Sets" (82 school sample and 15-school sample) are used by several tools (eg., Matlab and R) for assorted examples. They are shown here for background purposes because they are used for some NCL examples also. The data sets are here: law_school_15.txt and law_school_82.txt

Reference: An Introduction to the Bootstrap
           B. Efron and R.J. Tibshirani, Chapman and Hall (1993)

regress_8.ncl: The Old Faithful Geyser Data is used by R and other tools. It contains two variables: (a) the waiting time between eruptions, and (b) the durations of the eruptions for the Old Faithful Geyser. Here, a bivariate linear regression is performed using regline and regline_stats. A comparison to the output from R is shown in the examples associated with these functions.

regress_qreg_R.ncl: This example, contributed by Laurent Terray [CERFACS; 15 Aug 2018], illustrates how to invoke an R-function that performs Quantile Regression and Ordinary Least Squares from within an NCL script. Because it is not possible to directly pass variables to/from NCL and R, the script uses files as the method of communication. The files could be any format that NCL and R can handle (eg: text, binary, netCDF, ...). This example uses text (csv) files. Specifically, the NCL driver script:

(a) reads a netCDF file;
(b) extracts a user specified region of interest using coordinate subscripting syntax: {...};
(c) calculates time series of areal means using wgt_areaave_Wrap>;
(d) creates a csv text file containing the time series of areal means;
(e) uses system to invoke the R driver script: QuantReg.R;
(e1) R function: rq does the quantile regression;
(e2) R function: ols performs the ordinary least squares fit;
(f) reads the csv files created by the R driver script;
(g) use NCL graphics to plot the returned information.

If the R-package "quantreg" is not locally available, it must be installed. Start R:

      > install.packages("quantreg")
      > q()

manken_1.ncl: Read an ascii file (TA_Globe.1850-2014.txt; source (Met Office Hadley Centre) and extract pertinent columns. Compute the simple linear regression (regline_stats; red) and Mann-Kendall/Theil-Sen (trend_manken; blue) trend estimates over three periods. Here, there is virtually no difference between the derived values. Both 'lines' are plotted but only the values from the Mann-Kendall/Theil-Sen are shown. If there are outliers or skewed data, the Theil-Sen is likely to give the better linear trend estimate.

manken_2.ncl: Read an ascii file (rutgers.snow_cover_extent.time_series.txt; source ( Rutgers Univ Global Snow Lab) and extract pertinent columns. Compute the linear regression and Theil-Sen trend estimates for winter (Nov-Dec-Jan-Feb-Mar) and summer (May-Jun-Jul-Aug-Sep) over the satellite era (1979-2014). Both simple linear regression (regline_stats; red) and Mann-Kendall/Theil-Sen (trend_manken; blue) estimates are shown. The summer trend lines illustrate a small difference between the trend estimates. It is likely that the distribution of residuals about the linear regression line is slightly skewed. Hence, the Theil-Sen is likely better.