티스토리 뷰

11 02 Lecture Note
require(rhdfs)
필요한 패키지를 로딩중입니다: rhdfs
필요한 패키지를 로딩중입니다: rJava

HADOOP_CMD=/home/stat/hadoop/hadoop-2.7.4/bin/hadoop

Be sure to run hdfs.init()
경고메시지(들): 
S3 메소드 ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’는 NAMESPACE 내에 선언되었으나 찾을 수 없습니다 
hdfs.init()
17/11/01 14:41:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
require(rmr2)
필요한 패키지를 로딩중입니다: rmr2

K-means Clustering

거리 함수 정의 (dist_fun)

C : Centers of Clusters

P : Points

dist_fun <- function(C, P){
  apply(C, 1, function(x){ colSums( (t(P) - x)^2 ) } )
}

MAPPER (kmeans_map)

C : Centers of Clusters

k : keys

v : values (Points)

D : distances

nearest : The cluster which has the nearest center

kmeans_map <- function(k,v){
  nearest <- {
    if(is.null(C)){
      sample( 1:num.clusters, nrow(v), replace = T)
    } else {
      D <- dist_fun(C,v)
      nearest <- max.col(-D)
      }
  }
  keyval(nearest, v)
}

REDUCER (kmeans_reduce)

k : keys (nearest)

v : values (Points)

kmeans_reduce <- function(k,v){
  keyval(k,t(as.matrix(apply(v,2,mean))))
}

MAPREDUCE (kmeans_mr)

kmeans_mr <- function(P, num.clusters = 3, num.iters = 5){
  C <- NULL
  for( i in 1:num.iters){
    result <- from.dfs(
      mapreduce(
        input = P, map = kmeans_map, reduce = kmeans_reduce))
    C <- result$val
    if(nrow(C) < num.clusters){
      C <- rbind(
        C,
        matrix(
          rnorm(
            (num.clusters - nrow(C)) * nrow(C)),
            ncol = nrow(C)) %*% C) }
    }
  return(C)
}

실행

head(iris)
ABCDEFGHIJ0123456789
 
 
Sepal.Length
<dbl>
Sepal.Width
<dbl>
Petal.Length
<dbl>
Petal.Width
<dbl>
Species
<fctr>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
X <- as.matrix(iris[,-5])
head(X)
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]          5.1         3.5          1.4         0.2
[2,]          4.9         3.0          1.4         0.2
[3,]          4.7         3.2          1.3         0.2
[4,]          4.6         3.1          1.5         0.2
[5,]          5.0         3.6          1.4         0.2
[6,]          5.4         3.9          1.7         0.4
P            <- to.dfs(X)
17/11/01 14:41:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
17/11/01 14:41:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
num.clusters <- 3
num.iters    <- 10
centers_hadoop <- kmeans_mr(P = P,
                            num.clusters = num.clusters,
                            num.iters = num.iters)
centers_kmeans <- kmeans(X, centers = 3, iter.max = 10L)$centers
centers_hadoop
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]     6.853846    3.076923     5.715385    2.053846
[2,]     5.883607    2.740984     4.388525    1.434426
[3,]     5.006000    3.428000     1.462000    0.246000
centers_kmeans
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.006000    3.428000     1.462000    0.246000
2     5.901613    2.748387     4.393548    1.433871
3     6.850000    3.073684     5.742105    2.071053

Exercise

이것을 바탕으로 k-medians 코드를 작성할 수 있는가?

LS0tCnRpdGxlOiAiMTEgMDIgTGVjdHVyZSBOb3RlIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpgYGB7cn0KcmVxdWlyZShyaGRmcykKaGRmcy5pbml0KCkKcmVxdWlyZShybXIyKQpgYGAKCiMgSy1tZWFucyBDbHVzdGVyaW5nCgojIyDqsbDrpqwg7ZWo7IiYIOygleydmCAoZGlzdF9mdW4pCgpDIDogQ2VudGVycyBvZiBDbHVzdGVycwoKUCA6IFBvaW50cwoKYGBge3J9CmRpc3RfZnVuIDwtIGZ1bmN0aW9uKEMsIFApewogIGFwcGx5KEMsIDEsIGZ1bmN0aW9uKHgpeyBjb2xTdW1zKCAodChQKSAtIHgpXjIgKSB9ICkKfQpgYGAKCiMjIE1BUFBFUiAoa21lYW5zX21hcCkKCkMgOiBDZW50ZXJzIG9mIENsdXN0ZXJzCgprIDoga2V5cwoKdiA6IHZhbHVlcyAoUG9pbnRzKQoKRCA6IGRpc3RhbmNlcwoKbmVhcmVzdCA6IFRoZSBjbHVzdGVyIHdoaWNoIGhhcyB0aGUgbmVhcmVzdCBjZW50ZXIKCmBgYHtyfQprbWVhbnNfbWFwIDwtIGZ1bmN0aW9uKGssdil7CiAgbmVhcmVzdCA8LSB7CiAgICBpZihpcy5udWxsKEMpKXsKICAgICAgc2FtcGxlKCAxOm51bS5jbHVzdGVycywgbnJvdyh2KSwgcmVwbGFjZSA9IFQpCiAgICB9IGVsc2UgewogICAgICBEIDwtIGRpc3RfZnVuKEMsdikKICAgICAgbmVhcmVzdCA8LSBtYXguY29sKC1EKQogICAgICB9CiAgfQogIGtleXZhbChuZWFyZXN0LCB2KQp9CmBgYAoKIyMgUkVEVUNFUiAoa21lYW5zX3JlZHVjZSkKCmsgOiBrZXlzIChuZWFyZXN0KQoKdiA6IHZhbHVlcyAoUG9pbnRzKQoKYGBge3J9CmttZWFuc19yZWR1Y2UgPC0gZnVuY3Rpb24oayx2KXsKICBrZXl2YWwoayx0KGFzLm1hdHJpeChhcHBseSh2LDIsbWVhbikpKSkKfQpgYGAKCiMjIE1BUFJFRFVDRSAoa21lYW5zX21yKQoKYGBge3J9CmttZWFuc19tciA8LSBmdW5jdGlvbihQLCBudW0uY2x1c3RlcnMgPSAzLCBudW0uaXRlcnMgPSA1KXsKICBDIDwtIE5VTEwKICBmb3IoIGkgaW4gMTpudW0uaXRlcnMpewogICAgcmVzdWx0IDwtIGZyb20uZGZzKAogICAgICBtYXByZWR1Y2UoCiAgICAgICAgaW5wdXQgPSBQLCBtYXAgPSBrbWVhbnNfbWFwLCByZWR1Y2UgPSBrbWVhbnNfcmVkdWNlKSkKICAgIEMgPC0gcmVzdWx0JHZhbAogICAgaWYobnJvdyhDKSA8IG51bS5jbHVzdGVycyl7CiAgICAgIEMgPC0gcmJpbmQoCiAgICAgICAgQywKICAgICAgICBtYXRyaXgoCiAgICAgICAgICBybm9ybSgKICAgICAgICAgICAgKG51bS5jbHVzdGVycyAtIG5yb3coQykpICogbnJvdyhDKSksCiAgICAgICAgICAgIG5jb2wgPSBucm93KEMpKSAlKiUgQykgfQogICAgfQogIHJldHVybihDKQp9CgpgYGAKCiMjIOyLpO2WiQoKYGBge3J9CmhlYWQoaXJpcykKWCA8LSBhcy5tYXRyaXgoaXJpc1ssLTVdKQpoZWFkKFgpCmBgYApgYGB7cn0KUCAgICAgICAgICAgIDwtIHRvLmRmcyhYKQpudW0uY2x1c3RlcnMgPC0gMwpudW0uaXRlcnMgICAgPC0gMTAKYGBgCgpgYGB7ciwgcmVzdWx0cz0iaGlkZSJ9CmNlbnRlcnNfaGFkb29wIDwtIGttZWFuc19tcihQID0gUCwKICAgICAgICAgICAgICAgICAgICAgICAgICAgIG51bS5jbHVzdGVycyA9IG51bS5jbHVzdGVycywKICAgICAgICAgICAgICAgICAgICAgICAgICAgIG51bS5pdGVycyA9IG51bS5pdGVycykKYGBgCgpgYGB7cn0KY2VudGVyc19rbWVhbnMgPC0ga21lYW5zKFgsIGNlbnRlcnMgPSAzLCBpdGVyLm1heCA9IDEwTCkkY2VudGVycwoKY2VudGVyc19oYWRvb3AKY2VudGVyc19rbWVhbnMKYGBgCgojIEV4ZXJjaXNlCgrsnbTqsoPsnYQg67CU7YOV7Jy866GcIGstbWVkaWFucyDsvZTrk5zrpbwg7J6R7ISx7ZWgIOyImCDsnojripTqsIA/


11 09 Solution
require(rhdfs)
필요한 패키지를 로딩중입니다: rhdfs
필요한 패키지를 로딩중입니다: rJava

HADOOP_CMD=/home/stat/hadoop/hadoop-2.7.4/bin/hadoop

Be sure to run hdfs.init()
hdfs.init()
17/11/05 12:51:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
require(rmr2)
필요한 패키지를 로딩중입니다: rmr2
S3 메소드 ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’는 NAMESPACE 내에 선언되었으나 찾을 수 없습니다Please review your hadoop settings. See help(hadoop.settings)

Exercise

이것을 바탕으로 k-medians 코드를 작성할 수 있는가?

K-medians Clustering

거리 함수 정의 (dist_fun)

C : Centers of Clusters

P : Points

dist_fun <- function(C, P){
  apply(C, 1, function(x){ colSums( abs(t(P) - x) ) } )
}

MAPPER (kmeans_map)

C : Centers of Clusters

k : keys

v : values (Points)

D : distances

nearest : The cluster which has the nearest center

kmeans_map <- function(k,v){
  nearest <- {
    if(is.null(C)){
      sample( 1:num.clusters, nrow(v), replace = T)
    } else {
      D <- dist_fun(C,v)
      nearest <- max.col(-D)
      }
  }
  keyval(nearest, v)
}

REDUCER (kmeans_reduce)

k : keys (nearest)

v : values (Points)

kmeans_reduce <- function(k,v){
  keyval(k,t(as.matrix(apply(v,2,median))))
}

MAPREDUCE (kmeans_mr)

kmeans_mr <- function(P, num.clusters = 3, num.iters = 5){
  C <- NULL
  for( i in 1:num.iters){
    result <- from.dfs(
      mapreduce(
        input = P, map = kmeans_map, reduce = kmeans_reduce))
    C <- result$val
    if(nrow(C) < num.clusters){
      C <- rbind(
        C,
        matrix(
          rnorm(
            (num.clusters - nrow(C)) * nrow(C)),
            ncol = nrow(C)) %*% C) }
    }
  return(C)
}

실행

head(iris)
ABCDEFGHIJ0123456789
 
 
Sepal.Length
<dbl>
Sepal.Width
<dbl>
Petal.Length
<dbl>
Petal.Width
<dbl>
Species
<fctr>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
X <- as.matrix(iris[,-5])
head(X)
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]          5.1         3.5          1.4         0.2
[2,]          4.9         3.0          1.4         0.2
[3,]          4.7         3.2          1.3         0.2
[4,]          4.6         3.1          1.5         0.2
[5,]          5.0         3.6          1.4         0.2
[6,]          5.4         3.9          1.7         0.4
C            <- NULL
P            <- to.dfs(X)
17/11/05 12:51:57 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
17/11/05 12:51:57 INFO compress.CodecPool: Got brand-new compressor [.deflate]
num.clusters <- 3
num.iters    <- 10
centers_hadoop <- kmeans_mr(P = P,
                            num.clusters = num.clusters,
                            num.iters = num.iters)
centers_hadoop
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]          6.0         3.0          4.5         1.4
[2,]          5.7         3.0          4.0         1.3
[3,]          5.7         3.1          4.2         1.3
LS0tCnRpdGxlOiAiMTEgMDkgU29sdXRpb24iCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCmBgYHtyfQpyZXF1aXJlKHJoZGZzKQpoZGZzLmluaXQoKQpyZXF1aXJlKHJtcjIpCmBgYAoKIyBFeGVyY2lzZQoK7J206rKD7J2EIOuwlO2DleycvOuhnCBrLW1lZGlhbnMg7L2U65Oc66W8IOyekeyEse2VoCDsiJgg7J6I64qU6rCAPwoKIyBLLW1lZGlhbnMgQ2x1c3RlcmluZwoKIyMg6rGw66asIO2VqOyImCDsoJXsnZggKGRpc3RfZnVuKQoKQyA6IENlbnRlcnMgb2YgQ2x1c3RlcnMKClAgOiBQb2ludHMKCmBgYHtyfQpkaXN0X2Z1biA8LSBmdW5jdGlvbihDLCBQKXsKICBhcHBseShDLCAxLCBmdW5jdGlvbih4KXsgY29sU3VtcyggYWJzKHQoUCkgLSB4KSApIH0gKQp9CmBgYAoKIyMgTUFQUEVSIChrbWVhbnNfbWFwKQoKQyA6IENlbnRlcnMgb2YgQ2x1c3RlcnMKCmsgOiBrZXlzCgp2IDogdmFsdWVzIChQb2ludHMpCgpEIDogZGlzdGFuY2VzCgpuZWFyZXN0IDogVGhlIGNsdXN0ZXIgd2hpY2ggaGFzIHRoZSBuZWFyZXN0IGNlbnRlcgoKYGBge3J9CmttZWFuc19tYXAgPC0gZnVuY3Rpb24oayx2KXsKICBuZWFyZXN0IDwtIHsKICAgIGlmKGlzLm51bGwoQykpewogICAgICBzYW1wbGUoIDE6bnVtLmNsdXN0ZXJzLCBucm93KHYpLCByZXBsYWNlID0gVCkKICAgIH0gZWxzZSB7CiAgICAgIEQgPC0gZGlzdF9mdW4oQyx2KQogICAgICBuZWFyZXN0IDwtIG1heC5jb2woLUQpCiAgICAgIH0KICB9CiAga2V5dmFsKG5lYXJlc3QsIHYpCn0KYGBgCgojIyBSRURVQ0VSIChrbWVhbnNfcmVkdWNlKQoKayA6IGtleXMgKG5lYXJlc3QpCgp2IDogdmFsdWVzIChQb2ludHMpCgpgYGB7cn0Ka21lYW5zX3JlZHVjZSA8LSBmdW5jdGlvbihrLHYpewogIGtleXZhbChrLHQoYXMubWF0cml4KGFwcGx5KHYsMixtZWRpYW4pKSkpCn0KYGBgCgojIyBNQVBSRURVQ0UgKGttZWFuc19tcikKCmBgYHtyfQprbWVhbnNfbXIgPC0gZnVuY3Rpb24oUCwgbnVtLmNsdXN0ZXJzID0gMywgbnVtLml0ZXJzID0gNSl7CiAgQyA8LSBOVUxMCiAgZm9yKCBpIGluIDE6bnVtLml0ZXJzKXsKICAgIHJlc3VsdCA8LSBmcm9tLmRmcygKICAgICAgbWFwcmVkdWNlKAogICAgICAgIGlucHV0ID0gUCwgbWFwID0ga21lYW5zX21hcCwgcmVkdWNlID0ga21lYW5zX3JlZHVjZSkpCiAgICBDIDwtIHJlc3VsdCR2YWwKICAgIGlmKG5yb3coQykgPCBudW0uY2x1c3RlcnMpewogICAgICBDIDwtIHJiaW5kKAogICAgICAgIEMsCiAgICAgICAgbWF0cml4KAogICAgICAgICAgcm5vcm0oCiAgICAgICAgICAgIChudW0uY2x1c3RlcnMgLSBucm93KEMpKSAqIG5yb3coQykpLAogICAgICAgICAgICBuY29sID0gbnJvdyhDKSkgJSolIEMpIH0KICAgIH0KICByZXR1cm4oQykKfQoKYGBgCgojIyDsi6TtlokKCmBgYHtyfQpoZWFkKGlyaXMpClggPC0gYXMubWF0cml4KGlyaXNbLC01XSkKaGVhZChYKQpgYGAKYGBge3J9CkMgICAgICAgICAgICA8LSBOVUxMClAgICAgICAgICAgICA8LSB0by5kZnMoWCkKbnVtLmNsdXN0ZXJzIDwtIDMKbnVtLml0ZXJzICAgIDwtIDEwCmBgYAoKYGBge3IsIHJlc3VsdHM9ImhpZGUifQpjZW50ZXJzX2hhZG9vcCA8LSBrbWVhbnNfbXIoUCA9IFAsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBudW0uY2x1c3RlcnMgPSBudW0uY2x1c3RlcnMsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBudW0uaXRlcnMgPSBudW0uaXRlcnMpCmBgYAoKYGBge3J9CmNlbnRlcnNfaGFkb29wCmBgYAo=
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
TAG
more
«   2025/05   »
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
글 보관함