Re: Find and aggregate duplicates

From: Dennis Shea <shea_at_nyahnyahspammersnyahnyah>
Date: Tue, 17 Mar 2009 10:51:11 -0600

A sample function and test driver is attached.
Quite possibly, others could write more efficient code.

Wall Clock Timings:
8000 elements: less than 1 sec
40000 elements: 6 seconds
80000 elements: 25 sec

Good luck

>
> Hello,
>
> We have two huge arrays with the same dimension:
>
> Days = (/19950308,19950314,19950314,...,20081228,20081231,20081231/)
>
> Data = (/ 12.1, 22.5, 32.0, 12.8, 16.0, 32.1/)
>
> There can be several data values for the same day.
>
> Days are irregular. Sure we can sort them, however there will be gaps
> and duplicates in this array.
>
> Is it possible to carry out the following tasks efficiently, no loops,
> using only built-in NCL functions:
>
> 1) Construct the list of unique days, no duplicates
>
> 2) Calculate amount of duplicates for each unique day
>
> 3) Calculate the data average for each unique day
>
> I see no such solution, only loops.
>
> Am I right?
>
> Thanks,
>
> Slava
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> ncl-talk mailing list
> List instructions, subscriber options, unsubscribe:
> http://mailman.ucar.edu/mailman/listinfo/ncl-talk
>

```--
======================================================
Dennis J. Shea                  tel: 303-497-1361    |
P.O. Box 3000                   fax: 303-497-1333    |
Climate Analysis Section                             |
Climate & Global Dynamics Div.                       |
National Center for Atmospheric Research             |
Boulder, CO  80307                                   |
USA                        email: shea 'at' ucar.edu |
======================================================

```

function uniqueDays(Days[*]:integer, Data[*]:numeric)

; There can be several data values for the same day.
; Days are irregular. Sure we can sort them, however there will
; be gaps and duplicates in this array.
;
; Is it possible to carry out the following tasks efficiently,
; no loops, using only built-in NCL functions:
;
; 1) Construct the list of unique days, no duplicates
; 2) Calculate amount of duplicates for each unique day
; 3) Calculate the data average for each unique day

local nDays, inum, uFlg, x, n, k, ii
begin
nDays = dimsizes( Days )

inum = new(nDays, integer)
uFlg = new(nDays, logical)
uFlg = True

if (isatt(Data, "_FillValue")) then
x = new( (/nDays,3/), "float", getFillValue(Data))
else
x = new( (/nDays,3/), "float", 1e20)
end if

do n=0,nDays-1
inum(n) = num(Days(n).eq.Days)
end do

if (all(inum.eq.1)) then
x(:,0) = Days ; unique days
x(:,1) = 1 ; number of duplicates
x(:,2) = Data ; average [only one value]
k = nDays-1 ; max subscript
else
k = -1
do n=0,nDays-1
if (inum(n).eq.1) then
k = k+1
x(k,0) = Days(n)
x(k,1) = 1
x(k,2) = Data(n)
uFlg(n)= False
else
if (uFlg(n)) then
k = k+1
x(k,0) = Days(n)
x(k,1) = inum(n)
ii = ind(Days(n).eq.Days)
x(k,2) = avg(Data(ii))
uFlg(ii) = False ; turn off all duplicate dates
delete(ii) ; may change next iteration
end if
end if
end do
end if

if (isatt(Data,"long_name")) then
x_at_long_name = Data_at_long_name
end if

x_at_description = "(0) Date; (1) # Duplicates; (2) Avg"
x!0 = "time"
x!1 = "info"

return( x(0:k,:) )
end

begin
;=======================================================
; MAIN
;=======================================================

Days = (/19950308,19950314,19950314,20081228,20081231,20081231, 20081231,20081231/)
Data = (/ 12.1, 22.5, 32.0, 12.8, 16.0, 32.1, -17.25, 33.9 /)
nDays = dimsizes(Days)

cr = inttochar(10)
print(cr+"=======> "+nDays+" <======"+cr)

dayStat = uniqueDays(Days, Data)
printVarSummary(dayStat)

uDays = floattoint( dayStat(:,0 ) )
uDups = floattoint( dayStat(:,1 ) )
uAvg = dayStat(:,2 )
print(uDays+" "+uDups+" "+uAvg )

; replicate 1000 time
print(cr+"=======> "+nDays*1000+" <======"+cr)
DAYS = ndtooned( conform_dims( (/1000,nDays/), Days, 1))
DATA = ndtooned( conform_dims( (/1000,nDays/), Data, 1))

wcStrt = systemfunc("date")
DAYStat = uniqueDays(DAYS, DATA)
wallClockElapseTime(wcStrt, "size="+(nDays*1000), 0)
printVarSummary(DAYStat)

uDAYS = floattoint( DAYStat(:,0 ) )
uDUPS = floattoint( DAYStat(:,1 ) )
uAVG = DAYStat(:,2 )
print(uDAYS+" "+uDUPS+" "+uAVG )

delete(DAYS)
delete(DATA)

; replicate 5000 time
print(cr+"=======> "+nDays*5000+" <======"+cr)
DAYS = ndtooned( conform_dims( (/5000,nDays/), Days, 1))
DATA = ndtooned( conform_dims( (/5000,nDays/), Data, 1))

wcStrt = systemfunc("date")
DAYStat = uniqueDays(DAYS, DATA)
wallClockElapseTime(wcStrt, "size="+(nDays*5000), 0)
printVarSummary(DAYStat)

uDAYS = floattoint( DAYStat(:,0 ) )
uDUPS = floattoint( DAYStat(:,1 ) )
uAVG = DAYStat(:,2 )
print(uDAYS+" "+uDUPS+" "+uAVG )

delete(DAYS)
delete(DATA)

; replicate 10000 time
print(cr+"=======> "+nDays*10000+" <======"+cr)
DAYS = ndtooned( conform_dims( (/10000,nDays/), Days, 1))
DATA = ndtooned( conform_dims( (/10000,nDays/), Data, 1))

wcStrt = systemfunc("date")
DAYStat = uniqueDays(DAYS, DATA)
wallClockElapseTime(wcStrt, "size="+(nDays*10000), 0)
printVarSummary(DAYStat)

uDAYS = floattoint( DAYStat(:,0 ) )
uDUPS = floattoint( DAYStat(:,1 ) )
uAVG = DAYStat(:,2 )
print(uDAYS+" "+uDUPS+" "+uAVG )
end

_______________________________________________
ncl-talk mailing list
List instructions, subscriber options, unsubscribe:
http://mailman.ucar.edu/mailman/listinfo/ncl-talk
Received on Tue Mar 17 2009 - 10:51:11 MDT

This archive was generated by hypermail 2.2.0 : Wed Mar 18 2009 - 14:50:21 MDT