前言

本方案基于Liu Xi的方案改进得到,较我提出的第一种方案也有非常大改进,统计结果更加准确,运算的速度也大幅提高。

数据

本文所使用数据的下载地址为:http://pan.baidu.com/s/1nt5eyhz ,密码:t3i6。数据格式和内容基本与Liu Xi的一致,只是添加了一些变量名。有兴趣的同学可以复制本文的结果。

步骤

加载安装包

library(lubridate)
library(tidyr)
library(plyr)
## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:lubridate':
## 
##     here
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(pipeR)

导入数据

setwd("E:/Project/WISE R Club/seminars") # 请改为你自己的工作路径
dat0<-read.csv('original1.csv',na.strings = "", header = TRUE, stringsAsFactors = FALSE)
dat <- dat0[,colSums(is.na(dat0))<nrow(dat0)]
dat$id <- as.character(dat$id)
dat <- dat[-which(is.na(dat$name)),] # exclude NA rows
str(dat)
## 'data.frame':    2146 obs. of  11 variables:
##  $ date  : chr  "2011-2-13" "2011-2-24" "2011-3-3" "2011-3-22" ...
##  $ id    : chr  "103840100190232" "103840100190232" "103840100190232" "103840100190232" ...
##  $ number: int  153 153 153 153 153 153 153 153 153 153 ...
##  $ name  : chr  "沈雪双" "沈雪双" "沈雪双" "沈雪双" ...
##  $ grade : chr  "10硕士" "10硕士" "10硕士" "10硕士" ...
##  $ X01   : chr  "16:44" "16:28" "16:28" "19:09" ...
##  $ X02   : chr  NA "18:15" "17:55" "20:39" ...
##  $ X03   : chr  NA NA NA NA ...
##  $ X04   : chr  NA NA NA NA ...
##  $ X05   : chr  NA NA NA NA ...
##  $ X06   : chr  NA NA NA NA ...

生成时间数据

pars_t <- function(x){paste(dat[,"date"], x, sep=" ")%>>%
            parse_date_time("%Y%m%d %H%M%S", truncated = 2, quiet = TRUE)}

tcli<-grep("^X",names(dat))

tcl<-llply(dat[,tcli], pars_t)%>>%
    as.data.frame

数据变换

对每一行计算有效考勤次数。

dift <- tcl[,2:length(tcli)]-tcl[,1:(length(tcli)-1)]
logi <- dift > as.period(30, "minuts")

times <- adply(logi, 1, function(x){
    a <- which(x==TRUE)
    tms <- ifelse(length(a)==1,length(a),ceiling(sum(x, na.rm = TRUE)/2)) # 这是我的方法与Liu Xi的唯一的不同
    ifelse(is.na(tms), 0, tms)
})
str(times)
## 'data.frame':    2146 obs. of  2 variables:
##  $ X1: Factor w/ 2146 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ V1: num  0 1 1 1 2 0 1 1 0 1 ...

汇总结果

count <- times[,2]

results <-dat[,-tcli] %>>%
    cbind(count) %>>%
    ddply(.(id, name), function(df) sum(df$count))%>>%
    dplyr::rename(times=V1) %>>%
    arrange(desc(times))

统计完毕。

结果展示

所以我们可以看一看那些同学听讲座比较积极:

knitr::kable(head(results),format = "markdown", align="l")
id name times
27720090153641 李云森 20
27720091152398 储丽娅 20
27720091152423 张志远 20
27720091152435 张扬 20
103840105640253 翁明明 19
27720091152439 杨进 19

我们还可以看一看那些同学刷讲座的次数是在2次以下的,需要提醒他们赶紧多去听讲座刷指纹。

warm <- results %>>%
    filter(times < 2)
knitr::kable(head(warm),format = "markdown", align="l")
id name times
17520082200593 孙丽向 1
17620082200757 李蛟 1
17620082200839 张杨红 1
19820082203236 陈丹 1
19820082203296 曾泽英 1
19920082203609 马丽娜 1

与Liu Xi的结果对比

按照Xi Liu的方法计算有效次数。

times2 <- adply(logi, 1, function(x){
    a <- which(x==TRUE)
    tms <- ifelse(all(diff(a)>1)|length(a)==1,length(a),length(a)-ceiling(length(which(diff(a)==1))/2)) 
    ifelse(is.na(tms), 0, tms)
})

汇总结果。

count2 <- times2[,2]

results_xi <-dat[,-tcli] %>>%
    cbind(count2) %>>%
    ddply(.(id, name), function(df) sum(df$count))%>>%
    dplyr::rename(times=V1) %>>%
    arrange(desc(times))

进行比较。

cmpars <- merge(results, results_xi, by = c("name", "id"))
names(cmpars)[3:4] <- c("sun", "xi")
head(cmpars)
##               name             id sun xi
## 1             李蛟 17620082200757   1  1
## 2 Abiy Hailemariam 27720091154030   0  0
## 3  BENJAMIN NELSON 15220081153711   0  0
## 4    Benny Pandowo 27720091154050   3  3
## 5    Byung Lim Koo 27720081153767   0  0
## 6        Carl Zhan 27720101154458   0  0

两者之间结果不一致的情况次数。

sum(!((cmpars$sun-cmpars$xi)==0))
## [1] 0

Wow!!! 0!!!!

所以大多数情况下,两种方法等价。Wonderful!


结语

本文结果是在以下环境中产生的:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: i386-w64-mingw32/i386 (32-bit)
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pipeR_0.5       dplyr_0.3.0.2   plyr_1.8.1      tidyr_0.1      
## [5] lubridate_1.3.3
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   DBI_0.3.1        digest_0.6.4     evaluate_0.5.5  
##  [5] formatR_1.0      htmltools_0.2.6  knitr_1.8        lazyeval_0.1.9  
##  [9] magrittr_1.0.1   memoise_0.2.1    parallel_3.1.2   Rcpp_0.11.3     
## [13] rmarkdown_0.3.11 stringr_0.6.2    tools_3.1.2      yaml_2.1.13

欢迎针对本文的方法提出建设性的意见!

谢谢!