“apply” returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
(m = matrix(1:6, nrow = 2))
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
apply(m, 1, sum)
## [1] 9 12
“lapply” returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
lapply(1:5, sqrt)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414
##
## [[3]]
## [1] 1.732
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 2.236
“sapply” is a user-friendly version of lapply by default returning a vector or matrix if appropriate.
sapply(1:5, sqrt)
## [1] 1.000 1.414 1.732 2.000 2.236
“vapply” is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.
vapply(1:5, sqrt, 1i)
## [1] 1.000+0i 1.414+0i 1.732+0i 2.000+0i 2.236+0i
“rapply” is a recursive version of lapply.
(x = list(A = list(a=pi, b=list(b1=1)), B = "a character string"))
## $A
## $A$a
## [1] 3.142
##
## $A$b
## $A$b$b1
## [1] 1
##
##
##
## $B
## [1] "a character string"
rapply(x, sqrt, classes = "numeric", how = "unlist")
## A.a A.b.b1
## 1.772 1.000
rapply(x, sqrt, classes = "numeric", how = "replace")
## $A
## $A$a
## [1] 1.772
##
## $A$b
## $A$b$b1
## [1] 1
##
##
##
## $B
## [1] "a character string"
rapply(x, sqrt, classes = "numeric", how = "list")
## $A
## $A$a
## [1] 1.772
##
## $A$b
## $A$b$b1
## [1] 1
##
##
##
## $B
## NULL
“mapply” is a multivariate version of sapply. mapply applies FUN to the first elements of each … argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.
mapply(rep, LETTERS[1:3], 1:3)
## $A
## [1] "A"
##
## $B
## [1] "B" "B"
##
## $C
## [1] "C" "C" "C"
See also lapply, sapply, vapply, mapply, rapply and R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate.
Decompose the pollution series of Beijing into trend, seasonal, and random noise components. Does its air pollution go worse all the time?
load("air-quality.RData")
head(df)
## City Date Pollution
## 1 Beijing 6/5/2000 116
## 2 Tianjin 6/5/2000 81
## 3 Shijiazhuang 6/5/2000 127
## 4 Taiyuan 6/5/2000 155
## 5 Hohhot 6/5/2000 98
## 6 Shenyang 6/5/2000 93
tail(df)
## City Date Pollution
## 378381 Yibin 11/14/2013 49
## 378382 Zunyi 11/14/2013 53
## 378383 Tongchuan 11/14/2013 84
## 378384 Xianyang 11/14/2013 66
## 378385 Yan'an 11/14/2013 81
## 378386 Jinchang 11/14/2013 485
bj <- subset(df,City=="Beijing",select=c(Date,Pollution))
day <- as.POSIXlt(x = "2000-6-5", origin="2000-1-1")$yday
bjts <- ts(bj$Pollution,start=c(2000,day),freq=365.25)
plot(decompose(bjts))
The air quality of Beijing doesn’t go worse all the time. In fact, as you can see from the trend component of it’s pollution time series, the air quality of Beijing keeps improving all the time.
Which province has the worst air pollution? Which has the best air quality? (Build your own standard and use data to support your argument)
Here I provided a simple standard to sort the air quality of all these cities. My sandard is the average air pollution index of the entire sample period for all the cities. One thing that I need to remind you is the sample periods for the cities are quite different. So it seems impossible to calculate the average air pollution index for all the cities on the same time domain.
meanPol <- aggregate(df[,3], by=list(city=df$City), FUN = mean)
sortPol <- meanPol[order(meanPol$x, decreasing = T),]
colnames(sortPol) <- c("City", "Pollution")
head(sortPol)
## City Pollution
## 46 Lanzhou 114.57
## 87 Urumqi 104.25
## 78 Shijiazhuang 97.34
## 82 Taiyuan 96.58
## 7 Beijing 96.41
## 95 Xi'an 91.96
tail(sortPol)
## City Pollution
## 112 Zhanjiang 46.99
## 115 Zhongshan 46.38
## 6 Beihai 44.05
## 116 Zhuhai 43.73
## 27 Haikou 36.72
## 71 Sanya 22.90
sc <- sortPol$City
We find that Lanzhou has the worst airquality, while Sanya has the best air quality.
Decompose the time series for each city into stochastic trend, seasonal, and random noise components. From the trend data, what is the ratio of cities enjoying better air qualities than 10 years ago? What about those suffering from worse air quality than 10 years ago?
If the stochastic trend component of the pollution series has a positive time trend, then I say it enjoys better air qualities than 10 years ago. Otherwise, it suffers from worse air quality than 10 years ago.
library(plyr)
library(doParallel)
cl <- makeCluster(detectCores()) #Parallel computing
clusterCall(cl, function() {
fr <<- 365.25 # On average every year has 365.25 days
})
## [[1]]
## [1] 365.2
##
## [[2]]
## [1] 365.2
registerDoParallel(cl)
plut.coe <- ddply(.data = df,.variables = "City",.parallel = T, .fun = function(dt){
dts <- ts(dt$Pollution, freq = fr)
if(length(dts) >= 2*fr){
trend <- na.omit(decompose(dts)$trend)
coe <- lm(trend~time(trend))$coefficients[2]
}else{
coe <- NA
}
return(coe)
})
# I remove the cities that have observations less than 2 years.
stopCluster(cl)
colnames(plut.coe)<-c("City", "coefficient")
str(plut.coe)
## 'data.frame': 120 obs. of 2 variables:
## $ City : Factor w/ 120 levels "Anshan","Anyang",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ coefficient: num -2.74 8.193 NA -1.762 0.992 ...
na.city <- as.character(plut.coe[is.na(plut.coe$coefficient),1])
sort.coe <- plut.coe[order(plut.coe$coefficient, na.last = NA, decreasing = T),]
# the cities that I ignored
na.city
## [1] "Baoding" "Changzhou" "Handan" "Tangshan" "Wuxi" "Xuzhou"
head(sort.coe);tail(sort.coe)
## City coefficient
## 105 Yibin 9.462
## 2 Anyang 8.193
## 38 Jinchang 6.078
## 51 Luoyang 5.128
## 106 Yichang 4.860
## 34 Jiaozuo 4.819
## City coefficient
## 46 Lanzhou -4.961
## 78 Shijiazhuang -5.501
## 82 Taiyuan -5.855
## 86 Tongchuan -6.696
## 19 Datong -6.761
## 101 Yan'an -7.663
better <- subset(sort.coe, coefficient<0, select=c(City, coefficient))
# the cities that enjoy better air quality
print(as.character(better$City), quote=F, justify = c("left"))
## [1] Yuxi Wuhu Zhuhai
## [4] Zhenjiang Jilin Sanya
## [7] Liuzhou Jinzhou, Liaoning Kunming
## [10] Nanning Yantai Yangzhou
## [13] Mianyang Changde Guiyang
## [16] Quanzhou Fuzhou, Fujian Changchun
## [19] Chengdu Shenzhen Qingdao
## [22] Zigong Dalian Zhanjiang
## [25] Zhengzhou Yueyang Lianyungang
## [28] Qinhuangdao Guangzhou Nantong
## [31] Panzhihua Zhuzhou Kaifeng
## [34] Nanchang Shaoguan Shanghai
## [37] Suzhou, Jiangsu Qiqihar Lhasa
## [40] Mudanjiang Hangzhou Baoji
## [43] Jinan Harbin Urumqi
## [46] Zhangjiajie Nanjing Yinchuan
## [49] Xi'an Deyang Qujing
## [52] Shizuishan Xining Wuhan
## [55] Jiujiang Fushun Anshan
## [58] Beijing Chifeng Chongqing
## [61] Tianjin Luzhou Changzhi
## [64] Daqing Shenyang Hohhot
## [67] Weinan Benxi Pingdingshan
## [70] Changsha Yangquan Lanzhou
## [73] Shijiazhuang Taiyuan Tongchuan
## [76] Datong Yan'an
worse <- subset(sort.coe, coefficient>0, select=c(City, coefficient))
# the cities that suffer from worse air quality
print(as.character(worse$City), quote=F, justify = c("left"))
## [1] Yibin Anyang Jinchang
## [4] Luoyang Yichang Jiaozuo
## [7] Ma'anshan Linfen Xiangtan
## [10] Sanmenxia Rizhao Foshan
## [13] Xianyang Tai'an Zhongshan
## [16] Guilin Jining Zaozhuang
## [19] Taizhou, Zhejiang Shaoxing Weihai
## [22] Jingzhou Ningbo Weifang
## [25] Baotou Zibo Huzhou
## [28] Zunyi Beihai Nanchong
## [31] Karamay Wenzhou Jiaxing
## [34] Haikou Hefei Shantou
## [37] Xiamen
Since it will report error that “time series has no or less than 2 periods”. It counts year as period. So I have to remove the cities that have observations less than 2 years.
There are 77 cities enjoy better air quality than before, while 37 cities suffer from worse air quality.
Is the stochastic trend of the air pollution predictable? (Use data and model to support your argument)
Taking Beijing as an example, we get the stochastic trend of its pollution series first.
# the trend series of Beijing with NAs
bj_trend <- na.omit(decompose(bjts)$trend)
plot(bj_trend, main="The stochastic trend of Beijing air pollution series")
We can see from the graph that the stochastic trend is not stationary. This could be confirmed by run the following tests.
library(tseries)
ur_test <- function(ts){
cat("The p-value of Augmented Dickey???Fuller Test is", adf.test(ts)$p.value,"\n")
cat("The p-value of Phillips???Perron Unit Root Test is", pp.test(ts)$p.value,"\n")
cat("The p-value of KPSS Test for Stationarity is", kpss.test(ts)$p.value)
}
cat("The unit root tests on the stochastic trend of Beijing air pollution series:")
## The unit root tests on the stochastic trend of Beijing air pollution series:
ur_test(bj_trend)
## The p-value of Augmented Dickey???Fuller Test is 0.4499
## The p-value of Phillips???Perron Unit Root Test is 0.6876
## The p-value of KPSS Test for Stationarity is 0.01
So the stochastic trend is not stationary. Next I plot the ACF and PACF of this stochastic trend.
require(astsa)
x <- acf2(bj_trend, max.lag = 50)
You can judge from the graph that the stochastic trend of Beijing air pollution series seems like I(1). We then fit it to AR(1)
(bj_fit <- arima(bj_trend, order=c(1,0,0)))
##
## Call:
## arima(x = bj_trend, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 1 96.74
## s.e. 0 14.79
##
## sigma^2 estimated as 0.0415: log likelihood = 727.5, aic = -1449
As you can see, the coefficient (0.9999) is quite close to one. So the stochastic trend of Beijing air pollution series is random walk. So we can conclude that the stochastic trend is not predictable.
Are there any city whose stochastic trend of air pollution is comoving with that of Beijing? (hint: use cointegration tests to find out)
First of all, I find the overlapping period of each city with Beijing. Then I get two time series with the common and continuous date. I decompose the two series to get two stochastic trends and I test they are cointegrated. I have to remove some cities for two reasons: it has observations less than two periods(years), or its overlaping period with Beijing is less than two periods(years).
# cointegration test function
cointest<-function(regdata){
fitxy<-lm(y~x+0,data = regdata ,na.action = na.omit)
pvalue <- (adf.test(fitxy$residuals))$p.value
ifelse(pvalue<=0.05, 1, 0)
}
# function to get the continuous date observations
continu <- function(xx){
xx[,1] <- as.Date(xx[,1],format="%m/%d/%Y")
srt <- xx[order(xx$Date, decreasing = F),]
n<- dim(srt)[1]-1
for(i in 1:n){
diff.day <- as.numeric(srt$Date[i+1]-srt$Date[i])
if(diff.day>1){
srt[i,] <- NA
}else{
next
}
}
srt <- na.omit(srt)
return(srt)
}
# function to generate a data frame with two time series
getts <- function(ctn){
dst = ctn[1,1]
year <- as.numeric(format(dst,"%Y"))
ori<-paste(year,"-1-1",sep="")
day <- as.POSIXlt(dst, format="%m/%d/%Y",origin=ori)$yday
y.ts<-ts(ctn[,2], start=c(year,day), freq=fr)
x.ts<-ts(ctn[,3], start=c(year,day), freq=fr)
dts <- data.frame(y=y.ts, x=x.ts)
return(dts)
}
# workhorse function
test <- function(dt){
xx <- merge(bj, dt, by="Date")
TF <- nrow(xx)>2*fr
if(TF){
ctn <- continu(xx)
dts <- getts(ctn)
TF <- !is.null(nrow(dts)) & nrow(dts) > 2*fr
if(TF){
x <- na.omit(decompose(dts$x)$trend)
y <- na.omit(decompose(dts$y)$trend)
inter <- na.omit(ts.union(y, x,dframe=TRUE))
TF <- !is.null(nrow(inter)) & nrow(inter) > 20
if(TF){
regdata <- inter
}else{
regdata <- fake
}
TF <- !is.null(nrow(regdata)) & nrow(regdata) > 20
result <- ifelse(TF, cointest(regdata), NA)
}else{
result <- NA
}
}else{
result <- NA
}
result
}
Then I run the test. In the test outcomes, 1
is for Cointegration
, 0
is for Not Cointegration
, NA
is for Not Available
,
fr <- 365.25
df5 <- df[df["City"]!="Beijing",]
fake <- matrix(1:4,c(2,2))
# Parallel computing
cl <- makeCluster(detectCores())
clusterCall(cl, function() {
library(tseries)
})
## [[1]]
## [1] "tseries" "methods" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "base"
##
## [[2]]
## [1] "tseries" "methods" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "base"
clusterExport(cl,c("fr","bj","continu","test","getts","cointest"))
registerDoParallel(cl)
test.res <- ddply(.data = df5,.variables = "City",.parallel = T, .fun = function(dt){
test(dt)
})
stopCluster(cl)
colnames(test.res)<-c("City", "Coint")
head(test.res)
## City Coint
## 1 Anshan 0
## 2 Anyang NA
## 3 Baoding NA
## 4 Baoji 0
## 5 Baotou NA
## 6 Beihai 0
(resNA <- sum(is.na(test.res["Coint"])))
## [1] 34
test.res <- na.omit(test.res)
(res0 <- sum(test.res["Coint"]==0))
## [1] 85
(res1 <- sum(test.res["Coint"]==1))
## [1] 0
So we can see that 34 cities are excluded from the tests for not enough observations, 85 cities are not comoving with Beijing and 0 city is comoving with Beijing. So the answer to this question is NO
.
WISE R Club project is proudly maintained by XiaojunSun.