티스토리 뷰
11 02 Lecture Note
require(rhdfs)
필요한 패키지를 로딩중입니다: rhdfs
필요한 패키지를 로딩중입니다: rJava
HADOOP_CMD=/home/stat/hadoop/hadoop-2.7.4/bin/hadoop
Be sure to run hdfs.init()
경고메시지(들):
S3 메소드 ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’는 NAMESPACE 내에 선언되었으나 찾을 수 없습니다
hdfs.init()
17/11/01 14:41:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
require(rmr2)
필요한 패키지를 로딩중입니다: rmr2
K-means Clustering
거리 함수 정의 (dist_fun)
C : Centers of Clusters
P : Points
dist_fun <- function(C, P){
apply(C, 1, function(x){ colSums( (t(P) - x)^2 ) } )
}
MAPPER (kmeans_map)
C : Centers of Clusters
k : keys
v : values (Points)
D : distances
nearest : The cluster which has the nearest center
kmeans_map <- function(k,v){
nearest <- {
if(is.null(C)){
sample( 1:num.clusters, nrow(v), replace = T)
} else {
D <- dist_fun(C,v)
nearest <- max.col(-D)
}
}
keyval(nearest, v)
}
REDUCER (kmeans_reduce)
k : keys (nearest)
v : values (Points)
kmeans_reduce <- function(k,v){
keyval(k,t(as.matrix(apply(v,2,mean))))
}
MAPREDUCE (kmeans_mr)
kmeans_mr <- function(P, num.clusters = 3, num.iters = 5){
C <- NULL
for( i in 1:num.iters){
result <- from.dfs(
mapreduce(
input = P, map = kmeans_map, reduce = kmeans_reduce))
C <- result$val
if(nrow(C) < num.clusters){
C <- rbind(
C,
matrix(
rnorm(
(num.clusters - nrow(C)) * nrow(C)),
ncol = nrow(C)) %*% C) }
}
return(C)
}
실행
head(iris)
Sepal.Length <dbl> | Sepal.Width <dbl> | Petal.Length <dbl> | Petal.Width <dbl> | Species <fctr> | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
X <- as.matrix(iris[,-5])
head(X)
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 5.1 3.5 1.4 0.2
[2,] 4.9 3.0 1.4 0.2
[3,] 4.7 3.2 1.3 0.2
[4,] 4.6 3.1 1.5 0.2
[5,] 5.0 3.6 1.4 0.2
[6,] 5.4 3.9 1.7 0.4
P <- to.dfs(X)
17/11/01 14:41:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
17/11/01 14:41:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
num.clusters <- 3
num.iters <- 10
centers_hadoop <- kmeans_mr(P = P,
num.clusters = num.clusters,
num.iters = num.iters)
centers_kmeans <- kmeans(X, centers = 3, iter.max = 10L)$centers
centers_hadoop
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 6.853846 3.076923 5.715385 2.053846
[2,] 5.883607 2.740984 4.388525 1.434426
[3,] 5.006000 3.428000 1.462000 0.246000
centers_kmeans
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 5.901613 2.748387 4.393548 1.433871
3 6.850000 3.073684 5.742105 2.071053
Exercise
이것을 바탕으로 k-medians 코드를 작성할 수 있는가?
LS0tCnRpdGxlOiAiMTEgMDIgTGVjdHVyZSBOb3RlIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpgYGB7cn0KcmVxdWlyZShyaGRmcykKaGRmcy5pbml0KCkKcmVxdWlyZShybXIyKQpgYGAKCiMgSy1tZWFucyBDbHVzdGVyaW5nCgojIyDqsbDrpqwg7ZWo7IiYIOygleydmCAoZGlzdF9mdW4pCgpDIDogQ2VudGVycyBvZiBDbHVzdGVycwoKUCA6IFBvaW50cwoKYGBge3J9CmRpc3RfZnVuIDwtIGZ1bmN0aW9uKEMsIFApewogIGFwcGx5KEMsIDEsIGZ1bmN0aW9uKHgpeyBjb2xTdW1zKCAodChQKSAtIHgpXjIgKSB9ICkKfQpgYGAKCiMjIE1BUFBFUiAoa21lYW5zX21hcCkKCkMgOiBDZW50ZXJzIG9mIENsdXN0ZXJzCgprIDoga2V5cwoKdiA6IHZhbHVlcyAoUG9pbnRzKQoKRCA6IGRpc3RhbmNlcwoKbmVhcmVzdCA6IFRoZSBjbHVzdGVyIHdoaWNoIGhhcyB0aGUgbmVhcmVzdCBjZW50ZXIKCmBgYHtyfQprbWVhbnNfbWFwIDwtIGZ1bmN0aW9uKGssdil7CiAgbmVhcmVzdCA8LSB7CiAgICBpZihpcy5udWxsKEMpKXsKICAgICAgc2FtcGxlKCAxOm51bS5jbHVzdGVycywgbnJvdyh2KSwgcmVwbGFjZSA9IFQpCiAgICB9IGVsc2UgewogICAgICBEIDwtIGRpc3RfZnVuKEMsdikKICAgICAgbmVhcmVzdCA8LSBtYXguY29sKC1EKQogICAgICB9CiAgfQogIGtleXZhbChuZWFyZXN0LCB2KQp9CmBgYAoKIyMgUkVEVUNFUiAoa21lYW5zX3JlZHVjZSkKCmsgOiBrZXlzIChuZWFyZXN0KQoKdiA6IHZhbHVlcyAoUG9pbnRzKQoKYGBge3J9CmttZWFuc19yZWR1Y2UgPC0gZnVuY3Rpb24oayx2KXsKICBrZXl2YWwoayx0KGFzLm1hdHJpeChhcHBseSh2LDIsbWVhbikpKSkKfQpgYGAKCiMjIE1BUFJFRFVDRSAoa21lYW5zX21yKQoKYGBge3J9CmttZWFuc19tciA8LSBmdW5jdGlvbihQLCBudW0uY2x1c3RlcnMgPSAzLCBudW0uaXRlcnMgPSA1KXsKICBDIDwtIE5VTEwKICBmb3IoIGkgaW4gMTpudW0uaXRlcnMpewogICAgcmVzdWx0IDwtIGZyb20uZGZzKAogICAgICBtYXByZWR1Y2UoCiAgICAgICAgaW5wdXQgPSBQLCBtYXAgPSBrbWVhbnNfbWFwLCByZWR1Y2UgPSBrbWVhbnNfcmVkdWNlKSkKICAgIEMgPC0gcmVzdWx0JHZhbAogICAgaWYobnJvdyhDKSA8IG51bS5jbHVzdGVycyl7CiAgICAgIEMgPC0gcmJpbmQoCiAgICAgICAgQywKICAgICAgICBtYXRyaXgoCiAgICAgICAgICBybm9ybSgKICAgICAgICAgICAgKG51bS5jbHVzdGVycyAtIG5yb3coQykpICogbnJvdyhDKSksCiAgICAgICAgICAgIG5jb2wgPSBucm93KEMpKSAlKiUgQykgfQogICAgfQogIHJldHVybihDKQp9CgpgYGAKCiMjIOyLpO2WiQoKYGBge3J9CmhlYWQoaXJpcykKWCA8LSBhcy5tYXRyaXgoaXJpc1ssLTVdKQpoZWFkKFgpCmBgYApgYGB7cn0KUCAgICAgICAgICAgIDwtIHRvLmRmcyhYKQpudW0uY2x1c3RlcnMgPC0gMwpudW0uaXRlcnMgICAgPC0gMTAKYGBgCgpgYGB7ciwgcmVzdWx0cz0iaGlkZSJ9CmNlbnRlcnNfaGFkb29wIDwtIGttZWFuc19tcihQID0gUCwKICAgICAgICAgICAgICAgICAgICAgICAgICAgIG51bS5jbHVzdGVycyA9IG51bS5jbHVzdGVycywKICAgICAgICAgICAgICAgICAgICAgICAgICAgIG51bS5pdGVycyA9IG51bS5pdGVycykKYGBgCgpgYGB7cn0KY2VudGVyc19rbWVhbnMgPC0ga21lYW5zKFgsIGNlbnRlcnMgPSAzLCBpdGVyLm1heCA9IDEwTCkkY2VudGVycwoKY2VudGVyc19oYWRvb3AKY2VudGVyc19rbWVhbnMKYGBgCgojIEV4ZXJjaXNlCgrsnbTqsoPsnYQg67CU7YOV7Jy866GcIGstbWVkaWFucyDsvZTrk5zrpbwg7J6R7ISx7ZWgIOyImCDsnojripTqsIA/
11 09 Solution
require(rhdfs)
필요한 패키지를 로딩중입니다: rhdfs
필요한 패키지를 로딩중입니다: rJava
HADOOP_CMD=/home/stat/hadoop/hadoop-2.7.4/bin/hadoop
Be sure to run hdfs.init()
hdfs.init()
17/11/05 12:51:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
require(rmr2)
필요한 패키지를 로딩중입니다: rmr2
S3 메소드 ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’는 NAMESPACE 내에 선언되었으나 찾을 수 없습니다Please review your hadoop settings. See help(hadoop.settings)
Exercise
이것을 바탕으로 k-medians 코드를 작성할 수 있는가?
K-medians Clustering
거리 함수 정의 (dist_fun)
C : Centers of Clusters
P : Points
dist_fun <- function(C, P){
apply(C, 1, function(x){ colSums( abs(t(P) - x) ) } )
}
MAPPER (kmeans_map)
C : Centers of Clusters
k : keys
v : values (Points)
D : distances
nearest : The cluster which has the nearest center
kmeans_map <- function(k,v){
nearest <- {
if(is.null(C)){
sample( 1:num.clusters, nrow(v), replace = T)
} else {
D <- dist_fun(C,v)
nearest <- max.col(-D)
}
}
keyval(nearest, v)
}
REDUCER (kmeans_reduce)
k : keys (nearest)
v : values (Points)
kmeans_reduce <- function(k,v){
keyval(k,t(as.matrix(apply(v,2,median))))
}
MAPREDUCE (kmeans_mr)
kmeans_mr <- function(P, num.clusters = 3, num.iters = 5){
C <- NULL
for( i in 1:num.iters){
result <- from.dfs(
mapreduce(
input = P, map = kmeans_map, reduce = kmeans_reduce))
C <- result$val
if(nrow(C) < num.clusters){
C <- rbind(
C,
matrix(
rnorm(
(num.clusters - nrow(C)) * nrow(C)),
ncol = nrow(C)) %*% C) }
}
return(C)
}
실행
head(iris)
Sepal.Length <dbl> | Sepal.Width <dbl> | Petal.Length <dbl> | Petal.Width <dbl> | Species <fctr> | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
X <- as.matrix(iris[,-5])
head(X)
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 5.1 3.5 1.4 0.2
[2,] 4.9 3.0 1.4 0.2
[3,] 4.7 3.2 1.3 0.2
[4,] 4.6 3.1 1.5 0.2
[5,] 5.0 3.6 1.4 0.2
[6,] 5.4 3.9 1.7 0.4
C <- NULL
P <- to.dfs(X)
17/11/05 12:51:57 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
17/11/05 12:51:57 INFO compress.CodecPool: Got brand-new compressor [.deflate]
num.clusters <- 3
num.iters <- 10
centers_hadoop <- kmeans_mr(P = P,
num.clusters = num.clusters,
num.iters = num.iters)
centers_hadoop
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 6.0 3.0 4.5 1.4
[2,] 5.7 3.0 4.0 1.3
[3,] 5.7 3.1 4.2 1.3
LS0tCnRpdGxlOiAiMTEgMDkgU29sdXRpb24iCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCmBgYHtyfQpyZXF1aXJlKHJoZGZzKQpoZGZzLmluaXQoKQpyZXF1aXJlKHJtcjIpCmBgYAoKIyBFeGVyY2lzZQoK7J206rKD7J2EIOuwlO2DleycvOuhnCBrLW1lZGlhbnMg7L2U65Oc66W8IOyekeyEse2VoCDsiJgg7J6I64qU6rCAPwoKIyBLLW1lZGlhbnMgQ2x1c3RlcmluZwoKIyMg6rGw66asIO2VqOyImCDsoJXsnZggKGRpc3RfZnVuKQoKQyA6IENlbnRlcnMgb2YgQ2x1c3RlcnMKClAgOiBQb2ludHMKCmBgYHtyfQpkaXN0X2Z1biA8LSBmdW5jdGlvbihDLCBQKXsKICBhcHBseShDLCAxLCBmdW5jdGlvbih4KXsgY29sU3VtcyggYWJzKHQoUCkgLSB4KSApIH0gKQp9CmBgYAoKIyMgTUFQUEVSIChrbWVhbnNfbWFwKQoKQyA6IENlbnRlcnMgb2YgQ2x1c3RlcnMKCmsgOiBrZXlzCgp2IDogdmFsdWVzIChQb2ludHMpCgpEIDogZGlzdGFuY2VzCgpuZWFyZXN0IDogVGhlIGNsdXN0ZXIgd2hpY2ggaGFzIHRoZSBuZWFyZXN0IGNlbnRlcgoKYGBge3J9CmttZWFuc19tYXAgPC0gZnVuY3Rpb24oayx2KXsKICBuZWFyZXN0IDwtIHsKICAgIGlmKGlzLm51bGwoQykpewogICAgICBzYW1wbGUoIDE6bnVtLmNsdXN0ZXJzLCBucm93KHYpLCByZXBsYWNlID0gVCkKICAgIH0gZWxzZSB7CiAgICAgIEQgPC0gZGlzdF9mdW4oQyx2KQogICAgICBuZWFyZXN0IDwtIG1heC5jb2woLUQpCiAgICAgIH0KICB9CiAga2V5dmFsKG5lYXJlc3QsIHYpCn0KYGBgCgojIyBSRURVQ0VSIChrbWVhbnNfcmVkdWNlKQoKayA6IGtleXMgKG5lYXJlc3QpCgp2IDogdmFsdWVzIChQb2ludHMpCgpgYGB7cn0Ka21lYW5zX3JlZHVjZSA8LSBmdW5jdGlvbihrLHYpewogIGtleXZhbChrLHQoYXMubWF0cml4KGFwcGx5KHYsMixtZWRpYW4pKSkpCn0KYGBgCgojIyBNQVBSRURVQ0UgKGttZWFuc19tcikKCmBgYHtyfQprbWVhbnNfbXIgPC0gZnVuY3Rpb24oUCwgbnVtLmNsdXN0ZXJzID0gMywgbnVtLml0ZXJzID0gNSl7CiAgQyA8LSBOVUxMCiAgZm9yKCBpIGluIDE6bnVtLml0ZXJzKXsKICAgIHJlc3VsdCA8LSBmcm9tLmRmcygKICAgICAgbWFwcmVkdWNlKAogICAgICAgIGlucHV0ID0gUCwgbWFwID0ga21lYW5zX21hcCwgcmVkdWNlID0ga21lYW5zX3JlZHVjZSkpCiAgICBDIDwtIHJlc3VsdCR2YWwKICAgIGlmKG5yb3coQykgPCBudW0uY2x1c3RlcnMpewogICAgICBDIDwtIHJiaW5kKAogICAgICAgIEMsCiAgICAgICAgbWF0cml4KAogICAgICAgICAgcm5vcm0oCiAgICAgICAgICAgIChudW0uY2x1c3RlcnMgLSBucm93KEMpKSAqIG5yb3coQykpLAogICAgICAgICAgICBuY29sID0gbnJvdyhDKSkgJSolIEMpIH0KICAgIH0KICByZXR1cm4oQykKfQoKYGBgCgojIyDsi6TtlokKCmBgYHtyfQpoZWFkKGlyaXMpClggPC0gYXMubWF0cml4KGlyaXNbLC01XSkKaGVhZChYKQpgYGAKYGBge3J9CkMgICAgICAgICAgICA8LSBOVUxMClAgICAgICAgICAgICA8LSB0by5kZnMoWCkKbnVtLmNsdXN0ZXJzIDwtIDMKbnVtLml0ZXJzICAgIDwtIDEwCmBgYAoKYGBge3IsIHJlc3VsdHM9ImhpZGUifQpjZW50ZXJzX2hhZG9vcCA8LSBrbWVhbnNfbXIoUCA9IFAsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBudW0uY2x1c3RlcnMgPSBudW0uY2x1c3RlcnMsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBudW0uaXRlcnMgPSBudW0uaXRlcnMpCmBgYAoKYGBge3J9CmNlbnRlcnNfaGFkb29wCmBgYAo=
'Rhadoop' 카테고리의 다른 글
RHADOOP - WORD COUNT & WORD CLOUD -2 (17/11/14 Lecture Note) (0) | 2017.11.28 |
---|---|
RHADOOP - WORD COUNT & WORD CLOUD -1 (17/11/09 Lecture Note) (0) | 2017.11.28 |
RHADOOP - Linear Regression (17/10/26 Lecture Note) (0) | 2017.11.28 |
RHADOOP - HADOOP STREAMING (17/10/12 Lecture Note) (0) | 2017.11.28 |
RHADOOP MAPREDUCE -2. REDUCE (17/09/28 Lecture Note) (0) | 2017.11.28 |