本方案基于Liu Xi的方案改进得到,较我提出的第一种方案也有非常大改进,统计结果更加准确,运算的速度也大幅提高。
本文所使用数据的下载地址为:http://pan.baidu.com/s/1nt5eyhz ,密码:t3i6。数据格式和内容基本与Liu Xi的一致,只是添加了一些变量名。有兴趣的同学可以复制本文的结果。
library(lubridate)
library(tidyr)
library(plyr)
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:lubridate':
##
## here
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(pipeR)
setwd("E:/Project/WISE R Club/seminars") # 请改为你自己的工作路径
dat0<-read.csv('original1.csv',na.strings = "", header = TRUE, stringsAsFactors = FALSE)
dat <- dat0[,colSums(is.na(dat0))<nrow(dat0)]
dat$id <- as.character(dat$id)
dat <- dat[-which(is.na(dat$name)),] # exclude NA rows
str(dat)
## 'data.frame': 2146 obs. of 11 variables:
## $ date : chr "2011-2-13" "2011-2-24" "2011-3-3" "2011-3-22" ...
## $ id : chr "103840100190232" "103840100190232" "103840100190232" "103840100190232" ...
## $ number: int 153 153 153 153 153 153 153 153 153 153 ...
## $ name : chr "沈雪双" "沈雪双" "沈雪双" "沈雪双" ...
## $ grade : chr "10硕士" "10硕士" "10硕士" "10硕士" ...
## $ X01 : chr "16:44" "16:28" "16:28" "19:09" ...
## $ X02 : chr NA "18:15" "17:55" "20:39" ...
## $ X03 : chr NA NA NA NA ...
## $ X04 : chr NA NA NA NA ...
## $ X05 : chr NA NA NA NA ...
## $ X06 : chr NA NA NA NA ...
pars_t <- function(x){paste(dat[,"date"], x, sep=" ")%>>%
parse_date_time("%Y%m%d %H%M%S", truncated = 2, quiet = TRUE)}
tcli<-grep("^X",names(dat))
tcl<-llply(dat[,tcli], pars_t)%>>%
as.data.frame
对每一行计算有效考勤次数。
dift <- tcl[,2:length(tcli)]-tcl[,1:(length(tcli)-1)]
logi <- dift > as.period(30, "minuts")
times <- adply(logi, 1, function(x){
a <- which(x==TRUE)
tms <- ifelse(length(a)==1,length(a),ceiling(sum(x, na.rm = TRUE)/2)) # 这是我的方法与Liu Xi的唯一的不同
ifelse(is.na(tms), 0, tms)
})
str(times)
## 'data.frame': 2146 obs. of 2 variables:
## $ X1: Factor w/ 2146 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ V1: num 0 1 1 1 2 0 1 1 0 1 ...
count <- times[,2]
results <-dat[,-tcli] %>>%
cbind(count) %>>%
ddply(.(id, name), function(df) sum(df$count))%>>%
dplyr::rename(times=V1) %>>%
arrange(desc(times))
统计完毕。
所以我们可以看一看那些同学听讲座比较积极:
knitr::kable(head(results),format = "markdown", align="l")
id | name | times |
---|---|---|
27720090153641 | 李云森 | 20 |
27720091152398 | 储丽娅 | 20 |
27720091152423 | 张志远 | 20 |
27720091152435 | 张扬 | 20 |
103840105640253 | 翁明明 | 19 |
27720091152439 | 杨进 | 19 |
我们还可以看一看那些同学刷讲座的次数是在2次以下的,需要提醒他们赶紧多去听讲座刷指纹。
warm <- results %>>%
filter(times < 2)
knitr::kable(head(warm),format = "markdown", align="l")
id | name | times |
---|---|---|
17520082200593 | 孙丽向 | 1 |
17620082200757 | 李蛟 | 1 |
17620082200839 | 张杨红 | 1 |
19820082203236 | 陈丹 | 1 |
19820082203296 | 曾泽英 | 1 |
19920082203609 | 马丽娜 | 1 |
按照Xi Liu的方法计算有效次数。
times2 <- adply(logi, 1, function(x){
a <- which(x==TRUE)
tms <- ifelse(all(diff(a)>1)|length(a)==1,length(a),length(a)-ceiling(length(which(diff(a)==1))/2))
ifelse(is.na(tms), 0, tms)
})
汇总结果。
count2 <- times2[,2]
results_xi <-dat[,-tcli] %>>%
cbind(count2) %>>%
ddply(.(id, name), function(df) sum(df$count))%>>%
dplyr::rename(times=V1) %>>%
arrange(desc(times))
进行比较。
cmpars <- merge(results, results_xi, by = c("name", "id"))
names(cmpars)[3:4] <- c("sun", "xi")
head(cmpars)
## name id sun xi
## 1 李蛟 17620082200757 1 1
## 2 Abiy Hailemariam 27720091154030 0 0
## 3 BENJAMIN NELSON 15220081153711 0 0
## 4 Benny Pandowo 27720091154050 3 3
## 5 Byung Lim Koo 27720081153767 0 0
## 6 Carl Zhan 27720101154458 0 0
两者之间结果不一致的情况次数。
sum(!((cmpars$sun-cmpars$xi)==0))
## [1] 0
Wow!!! 0!!!!
所以大多数情况下,两种方法等价。Wonderful!
本文结果是在以下环境中产生的:
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pipeR_0.5 dplyr_0.3.0.2 plyr_1.8.1 tidyr_0.1
## [5] lubridate_1.3.3
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 DBI_0.3.1 digest_0.6.4 evaluate_0.5.5
## [5] formatR_1.0 htmltools_0.2.6 knitr_1.8 lazyeval_0.1.9
## [9] magrittr_1.0.1 memoise_0.2.1 parallel_3.1.2 Rcpp_0.11.3
## [13] rmarkdown_0.3.11 stringr_0.6.2 tools_3.1.2 yaml_2.1.13
欢迎针对本文的方法提出建设性的意见!
谢谢!