R Project 1 Solution

Quiz
- apply
- lapply
- sapply
- vapply
- rapply
- mapply
Project 1

Quiz

apply

“apply” returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.

(m = matrix(1:6, nrow = 2))

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

apply(m, 1, sum)

## [1]  9 12

lapply

“lapply” returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

lapply(1:5, sqrt)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414
## 
## [[3]]
## [1] 1.732
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236

sapply

“sapply” is a user-friendly version of lapply by default returning a vector or matrix if appropriate.

sapply(1:5, sqrt)

## [1] 1.000 1.414 1.732 2.000 2.236

vapply

“vapply” is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.

vapply(1:5, sqrt, 1i)

## [1] 1.000+0i 1.414+0i 1.732+0i 2.000+0i 2.236+0i

rapply

“rapply” is a recursive version of lapply.

(x = list(A = list(a=pi, b=list(b1=1)), B = "a character string"))

## $A
## $A$a
## [1] 3.142
## 
## $A$b
## $A$b$b1
## [1] 1
## 
## 
## 
## $B
## [1] "a character string"

rapply(x, sqrt, classes = "numeric", how = "unlist")

##    A.a A.b.b1 
##  1.772  1.000

rapply(x, sqrt, classes = "numeric", how = "replace")

## $A
## $A$a
## [1] 1.772
## 
## $A$b
## $A$b$b1
## [1] 1
## 
## 
## 
## $B
## [1] "a character string"

rapply(x, sqrt, classes = "numeric", how = "list")

## $A
## $A$a
## [1] 1.772
## 
## $A$b
## $A$b$b1
## [1] 1
## 
## 
## 
## $B
## NULL

mapply

“mapply” is a multivariate version of sapply. mapply applies FUN to the first elements of each … argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.

mapply(rep, LETTERS[1:3], 1:3)

## $A
## [1] "A"
## 
## $B
## [1] "B" "B"
## 
## $C
## [1] "C" "C" "C"

Project 1

Question 1

Decompose the pollution series of Beijing into trend, seasonal, and random noise components. Does its air pollution go worse all the time?

load("air-quality.RData")
head(df)

##           City     Date Pollution
## 1      Beijing 6/5/2000       116
## 2      Tianjin 6/5/2000        81
## 3 Shijiazhuang 6/5/2000       127
## 4      Taiyuan 6/5/2000       155
## 5       Hohhot 6/5/2000        98
## 6     Shenyang 6/5/2000        93

tail(df)

##             City       Date Pollution
## 378381     Yibin 11/14/2013        49
## 378382     Zunyi 11/14/2013        53
## 378383 Tongchuan 11/14/2013        84
## 378384  Xianyang 11/14/2013        66
## 378385    Yan'an 11/14/2013        81
## 378386  Jinchang 11/14/2013       485

bj <- subset(df,City=="Beijing",select=c(Date,Pollution))
day <- as.POSIXlt(x = "2000-6-5", origin="2000-1-1")$yday
bjts <- ts(bj$Pollution,start=c(2000,day),freq=365.25)
plot(decompose(bjts))

The decomposition of Beijing air pollution time series

The air quality of Beijing doesn’t go worse all the time. In fact, as you can see from the trend component of it’s pollution time series, the air quality of Beijing keeps improving all the time.

Question 2

Which province has the worst air pollution? Which has the best air quality? (Build your own standard and use data to support your argument)

Here I provided a simple standard to sort the air quality of all these cities. My sandard is the average air pollution index of the entire sample period for all the cities. One thing that I need to remind you is the sample periods for the cities are quite different. So it seems impossible to calculate the average air pollution index for all the cities on the same time domain.

meanPol <- aggregate(df[,3], by=list(city=df$City), FUN = mean)
sortPol <- meanPol[order(meanPol$x, decreasing = T),]
colnames(sortPol) <- c("City", "Pollution")
head(sortPol)

##            City Pollution
## 46      Lanzhou    114.57
## 87       Urumqi    104.25
## 78 Shijiazhuang     97.34
## 82      Taiyuan     96.58
## 7       Beijing     96.41
## 95        Xi'an     91.96

tail(sortPol)

##          City Pollution
## 112 Zhanjiang     46.99
## 115 Zhongshan     46.38
## 6      Beihai     44.05
## 116    Zhuhai     43.73
## 27     Haikou     36.72
## 71      Sanya     22.90

sc <- sortPol$City

We find that Lanzhou has the worst airquality, while Sanya has the best air quality.

Question 3

Decompose the time series for each city into stochastic trend, seasonal, and random noise components. From the trend data, what is the ratio of cities enjoying better air qualities than 10 years ago? What about those suffering from worse air quality than 10 years ago?

If the stochastic trend component of the pollution series has a positive time trend, then I say it enjoys better air qualities than 10 years ago. Otherwise, it suffers from worse air quality than 10 years ago.

library(plyr)
library(doParallel)
cl <- makeCluster(detectCores()) #Parallel computing
clusterCall(cl, function() {
    fr <<- 365.25  # On average every year has 365.25 days
})

## [[1]]
## [1] 365.2
## 
## [[2]]
## [1] 365.2

registerDoParallel(cl)
plut.coe <- ddply(.data = df,.variables = "City",.parallel = T, .fun = function(dt){
    dts <- ts(dt$Pollution, freq = fr)
    if(length(dts) >= 2*fr){
        trend <- na.omit(decompose(dts)$trend)
        coe <- lm(trend~time(trend))$coefficients[2]
    }else{
        coe <- NA 
    }
    return(coe)
})
# I remove the cities that have observations less than 2 years.
stopCluster(cl)
colnames(plut.coe)<-c("City", "coefficient")
str(plut.coe)

## 'data.frame':    120 obs. of  2 variables:
##  $ City       : Factor w/ 120 levels "Anshan","Anyang",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ coefficient: num  -2.74 8.193 NA -1.762 0.992 ...

na.city <- as.character(plut.coe[is.na(plut.coe$coefficient),1])
sort.coe <- plut.coe[order(plut.coe$coefficient, na.last = NA, decreasing = T),]
# the cities that I ignored
na.city

## [1] "Baoding"   "Changzhou" "Handan"    "Tangshan"  "Wuxi"      "Xuzhou"

head(sort.coe);tail(sort.coe)

##         City coefficient
## 105    Yibin       9.462
## 2     Anyang       8.193
## 38  Jinchang       6.078
## 51   Luoyang       5.128
## 106  Yichang       4.860
## 34   Jiaozuo       4.819

##             City coefficient
## 46       Lanzhou      -4.961
## 78  Shijiazhuang      -5.501
## 82       Taiyuan      -5.855
## 86     Tongchuan      -6.696
## 19        Datong      -6.761
## 101       Yan'an      -7.663

better <- subset(sort.coe, coefficient<0, select=c(City, coefficient))
# the cities that enjoy better air quality
print(as.character(better$City), quote=F, justify = c("left"))

##  [1] Yuxi              Wuhu              Zhuhai           
##  [4] Zhenjiang         Jilin             Sanya            
##  [7] Liuzhou           Jinzhou, Liaoning Kunming          
## [10] Nanning           Yantai            Yangzhou         
## [13] Mianyang          Changde           Guiyang          
## [16] Quanzhou          Fuzhou, Fujian    Changchun        
## [19] Chengdu           Shenzhen          Qingdao          
## [22] Zigong            Dalian            Zhanjiang        
## [25] Zhengzhou         Yueyang           Lianyungang      
## [28] Qinhuangdao       Guangzhou         Nantong          
## [31] Panzhihua         Zhuzhou           Kaifeng          
## [34] Nanchang          Shaoguan          Shanghai         
## [37] Suzhou, Jiangsu   Qiqihar           Lhasa            
## [40] Mudanjiang        Hangzhou          Baoji            
## [43] Jinan             Harbin            Urumqi           
## [46] Zhangjiajie       Nanjing           Yinchuan         
## [49] Xi'an             Deyang            Qujing           
## [52] Shizuishan        Xining            Wuhan            
## [55] Jiujiang          Fushun            Anshan           
## [58] Beijing           Chifeng           Chongqing        
## [61] Tianjin           Luzhou            Changzhi         
## [64] Daqing            Shenyang          Hohhot           
## [67] Weinan            Benxi             Pingdingshan     
## [70] Changsha          Yangquan          Lanzhou          
## [73] Shijiazhuang      Taiyuan           Tongchuan        
## [76] Datong            Yan'an

worse <- subset(sort.coe, coefficient>0, select=c(City, coefficient))
# the cities that suffer from worse air quality
print(as.character(worse$City), quote=F, justify = c("left"))

##  [1] Yibin             Anyang            Jinchang         
##  [4] Luoyang           Yichang           Jiaozuo          
##  [7] Ma'anshan         Linfen            Xiangtan         
## [10] Sanmenxia         Rizhao            Foshan           
## [13] Xianyang          Tai'an            Zhongshan        
## [16] Guilin            Jining            Zaozhuang        
## [19] Taizhou, Zhejiang Shaoxing          Weihai           
## [22] Jingzhou          Ningbo            Weifang          
## [25] Baotou            Zibo              Huzhou           
## [28] Zunyi             Beihai            Nanchong         
## [31] Karamay           Wenzhou           Jiaxing          
## [34] Haikou            Hefei             Shantou          
## [37] Xiamen

Since it will report error that “time series has no or less than 2 periods”. It counts year as period. So I have to remove the cities that have observations less than 2 years.

There are 77 cities enjoy better air quality than before, while 37 cities suffer from worse air quality.

Question 4

Is the stochastic trend of the air pollution predictable? (Use data and model to support your argument)

Taking Beijing as an example, we get the stochastic trend of its pollution series first.

# the trend series of Beijing with NAs
bj_trend <- na.omit(decompose(bjts)$trend) 
plot(bj_trend, main="The stochastic trend of Beijing air pollution series")

plot of chunk trend_plot

We can see from the graph that the stochastic trend is not stationary. This could be confirmed by run the following tests.

library(tseries)
ur_test <- function(ts){
    cat("The p-value of Augmented Dickey???Fuller Test is", adf.test(ts)$p.value,"\n")
    cat("The p-value of Phillips???Perron Unit Root Test is", pp.test(ts)$p.value,"\n")
    cat("The p-value of KPSS Test for Stationarity is", kpss.test(ts)$p.value)
}

cat("The unit root tests on the stochastic trend of Beijing air pollution series:")

## The unit root tests on the stochastic trend of Beijing air pollution series:

ur_test(bj_trend)

## The p-value of Augmented Dickey???Fuller Test is 0.4499 
## The p-value of Phillips???Perron Unit Root Test is 0.6876 
## The p-value of KPSS Test for Stationarity is 0.01

So the stochastic trend is not stationary. Next I plot the ACF and PACF of this stochastic trend.

require(astsa)
x <- acf2(bj_trend, max.lag = 50)

plot of chunk acf2

You can judge from the graph that the stochastic trend of Beijing air pollution series seems like I(1). We then fit it to AR(1)

(bj_fit <- arima(bj_trend, order=c(1,0,0)))

## 
## Call:
## arima(x = bj_trend, order = c(1, 0, 0))
## 
## Coefficients:
##       ar1  intercept
##         1      96.74
## s.e.    0      14.79
## 
## sigma^2 estimated as 0.0415:  log likelihood = 727.5,  aic = -1449

As you can see, the coefficient (0.9999) is quite close to one. So the stochastic trend of Beijing air pollution series is random walk. So we can conclude that the stochastic trend is not predictable.

Question 5

Are there any city whose stochastic trend of air pollution is comoving with that of Beijing? (hint: use cointegration tests to find out)

First of all, I find the overlapping period of each city with Beijing. Then I get two time series with the common and continuous date. I decompose the two series to get two stochastic trends and I test they are cointegrated. I have to remove some cities for two reasons: it has observations less than two periods(years), or its overlaping period with Beijing is less than two periods(years).

# cointegration test function
cointest<-function(regdata){
    fitxy<-lm(y~x+0,data = regdata ,na.action = na.omit)
    pvalue <- (adf.test(fitxy$residuals))$p.value
    ifelse(pvalue<=0.05, 1, 0)
}

# function to get the continuous date observations
continu <- function(xx){
    xx[,1] <- as.Date(xx[,1],format="%m/%d/%Y")
    srt <- xx[order(xx$Date, decreasing = F),]
    n<- dim(srt)[1]-1
    for(i in 1:n){
        diff.day <- as.numeric(srt$Date[i+1]-srt$Date[i])
        if(diff.day>1){
            srt[i,] <- NA
        }else{
            next
        }
    }
    srt <- na.omit(srt)
    return(srt)
}

# function to generate a data frame with two time series
getts <- function(ctn){
    dst = ctn[1,1]
    year <- as.numeric(format(dst,"%Y"))
    ori<-paste(year,"-1-1",sep="")
    day <- as.POSIXlt(dst, format="%m/%d/%Y",origin=ori)$yday
    y.ts<-ts(ctn[,2], start=c(year,day), freq=fr)
    x.ts<-ts(ctn[,3], start=c(year,day), freq=fr)
    dts <- data.frame(y=y.ts, x=x.ts)
    return(dts)
}

# workhorse function
test <- function(dt){
    xx <- merge(bj, dt, by="Date")
    TF <- nrow(xx)>2*fr
    if(TF){
        ctn <- continu(xx)
        dts <- getts(ctn)
        TF <- !is.null(nrow(dts)) & nrow(dts) > 2*fr
        if(TF){
            x <- na.omit(decompose(dts$x)$trend)
            y <- na.omit(decompose(dts$y)$trend)
            inter <- na.omit(ts.union(y, x,dframe=TRUE))
            TF <- !is.null(nrow(inter)) & nrow(inter) > 20
            if(TF){
                regdata <- inter
            }else{
                regdata <- fake
            }
            TF <- !is.null(nrow(regdata)) & nrow(regdata) > 20
            result <- ifelse(TF, cointest(regdata), NA)
        }else{
            result <- NA
        }
    }else{
        result <- NA
    }
    result
}

Then I run the test. In the test outcomes, 1 is for Cointegration, 0 is for Not Cointegration, NA is for Not Available,

fr <- 365.25
df5 <- df[df["City"]!="Beijing",]
fake <- matrix(1:4,c(2,2))
# Parallel computing
cl <- makeCluster(detectCores()) 
clusterCall(cl, function() {
    library(tseries)
})

## [[1]]
## [1] "tseries"   "methods"   "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "base"     
## 
## [[2]]
## [1] "tseries"   "methods"   "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "base"

clusterExport(cl,c("fr","bj","continu","test","getts","cointest"))
registerDoParallel(cl)
test.res <- ddply(.data = df5,.variables = "City",.parallel = T, .fun = function(dt){
    test(dt)
})
stopCluster(cl)
colnames(test.res)<-c("City", "Coint")
head(test.res)

##      City Coint
## 1  Anshan     0
## 2  Anyang    NA
## 3 Baoding    NA
## 4   Baoji     0
## 5  Baotou    NA
## 6  Beihai     0

(resNA <- sum(is.na(test.res["Coint"])))

## [1] 34

test.res <- na.omit(test.res)
(res0 <- sum(test.res["Coint"]==0))

## [1] 85

(res1 <- sum(test.res["Coint"]==1))

## [1] 0

So we can see that 34 cities are excluded from the tests for not enough observations, 85 cities are not comoving with Beijing and 0 city is comoving with Beijing. So the answer to this question is NO.

WISE R Club project is proudly maintained by XiaojunSun.