Clustering using Autoencoder (ver.Python)¶

Iris¶

import pandas as pd

iris = pd.read_csv("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")

iris.head()

iris.shape

(150, 5)

iris.describe()

iris.variety.value_counts()

Versicolor    50
Setosa        50
Virginica     50
Name: variety, dtype: int64

Manipulating the data set¶

import numpy as np

np.random.seed(1)
test_obs = np.random.choice(150,50,replace=False)

training_set = iris.loc[set(iris.index) - set(test_obs)]
testing_set = iris.loc[test_obs]

training_y = training_set.variety
training_x = training_set.drop("variety",axis=1)

testing_y = testing_set.variety
testing_x = testing_set.drop("variety",axis=1)

training_y.value_counts()

Virginica     36
Setosa        33
Versicolor    31
Name: variety, dtype: int64

testing_y.value_counts()

Versicolor    19
Setosa        17
Virginica     14
Name: variety, dtype: int64

std_vec = training_x.std()
mean_vec = training_x.mean()

std_vec

sepal.length    0.832159
sepal.width     0.414888
petal.length    1.762407
petal.width     0.777899
dtype: float64

mean_vec

sepal.length    5.806
sepal.width     3.033
petal.length    3.772
petal.width     1.205
dtype: float64

training_x = (training_x - mean_vec) / std_vec
testing_x = (testing_x - mean_vec) / std_vec

training_x.shape

(100, 4)

모형에 대한 자세한 설명은 생략하도록 하겠습니다.¶

Autoencoder, elu, softmax, Adam

encoding : [inner product -> elu] -> [inner product -> elu] -> [inner product -> elu] -> [inner product -> elu] -> [innerproduct -> softmax]
decoding : [inner product -> elu] -> [inner product -> elu] -> [inner product -> elu] -> [inner product -> elu] -> [innerproduct]
Loss : squared error loss
Optimizer : Adam

import tensorflow as tf

x = tf.placeholder("float", [None, 4])

함수 정의 : weight_variable - truncated normal distribution에서 난수 발생해서 원하는 모양으로 weight tensor를 만드는 함수.
함수 정의 : bias_variable - 원하는 모양으로 bias tensor를 만드는 함수.

def weight_variable(shape):
    initial = tf.truncated_normal(shape)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.truncated_normal(shape)
    return tf.Variable(initial)

모형 설정¶

Encoder¶

1 [inner product -> elu]

W_en1 = weight_variable([4,6])
b_en1 = bias_variable([6])
en_layer1 = tf.nn.elu(tf.matmul(x, W_en1) + b_en1)

2 [inner product -> elu]

W_en2 = weight_variable([6,6])
b_en2 = bias_variable([6])
en_layer2 = tf.nn.elu(tf.matmul(en_layer1, W_en2) + b_en2)

3 [inner product -> elu]

W_en3 = weight_variable([6,5])
b_en3 = bias_variable([5])
en_layer3 = tf.nn.elu(tf.matmul(en_layer2, W_en3) + b_en3)

4 [inner product -> elu]

W_en4 = weight_variable([5,4])
b_en4 = bias_variable([4])
en_layer4 = tf.nn.elu(tf.matmul(en_layer3, W_en4) + b_en4)

5 [inner product -> softmax]

W_en5 = weight_variable([4,3])
b_en5 = bias_variable([3])
en_layer5 = tf.nn.softmax(tf.matmul(en_layer4, W_en5) + b_en5)

Decoder¶

1 [inner product -> elu]

W_de1 = weight_variable([3,4])
b_de1 = bias_variable([4])
de_layer1 = tf.nn.elu(tf.matmul(en_layer5, W_de1) + b_de1)

2 [inner product -> elu]

W_de2 = weight_variable([4,5])
b_de2 = bias_variable([5])
de_layer2 = tf.nn.elu(tf.matmul(de_layer1, W_de2) + b_de2)

3 [inner product -> elu]

W_de3 = weight_variable([5,6])
b_de3 = bias_variable([6])
de_layer3 = tf.nn.elu(tf.matmul(de_layer2, W_de3) + b_de3)

4 [inner product -> elu]

W_de4 = weight_variable([6,6])
b_de4 = bias_variable([6])
de_layer4 = tf.nn.elu(tf.matmul(de_layer3, W_de4) + b_de4)

5 [inner product]

W_de5 = weight_variable([6,4])
b_de5 = bias_variable([4])
de_layer5 = tf.matmul(de_layer4, W_de5) + b_de5

Loss 와 Optimizer 설정¶

SSE = tf.reduce_sum(tf.square(tf.subtract(de_layer5,x)))
train_step = tf.train.AdamOptimizer(1e-3).minimize(SSE)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for i in range(100001):
    batch_obs = np.random.choice(training_x.index,50,replace=False)
    sess.run(train_step, feed_dict = {x: training_x.loc[batch_obs]})
    
    if i % 20000 == 0:
        train_accuracy = sess.run(SSE, feed_dict = {x: training_x})
        print("step " + str(i) + " training accuracy " + str(train_accuracy))

step 0 training accuracy 2421.058
step 20000 training accuracy 15.231902
step 40000 training accuracy 8.189543
step 60000 training accuracy 5.4591618
step 80000 training accuracy 4.884925
step 100000 training accuracy 4.710042

Clustering 결과¶

training¶

result = sess.run(tf.argmax(en_layer5, 1), feed_dict = {x: training_x})

tempdf = pd.concat(
    [pd.DataFrame(training_y),
     pd.DataFrame(result,index=training_y.index,columns=["clu"])
    ],axis=1)

tempdf.groupby(by=["clu"]).variety.value_counts()

clu  variety   
0    Virginica     35
     Versicolor    28
1    Setosa        33
2    Versicolor     3
     Virginica      1
Name: variety, dtype: int64

testing¶

result = sess.run(tf.argmax(en_layer5, 1), feed_dict = {x: testing_x})

tempdf = pd.concat(
    [pd.DataFrame(testing_y),
     pd.DataFrame(result,index=testing_y.index,columns=["clu"])
    ],axis=1)

tempdf.groupby(by=["clu"]).variety.value_counts()

clu  variety   
0    Versicolor    19
     Virginica     12
1    Setosa        17
2    Virginica      2
Name: variety, dtype: int64

이런 결과가 나온 것에 대해서 당연하게 의문점이 생긴다.

Clustering이 제대로 되지 않고, iris의 대부분의 특성이 첫번째 feature에 몰려버리고 세부적인 특성이 나머지 두번째, 세번째 feature에 드러나도록 학습된 듯 하다.
Autoencoder를 Clustering에 사용하기에는 무리인 듯 하다.

	sepal.length	sepal.width	petal.length	petal.width	variety
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3.0	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5.0	3.6	1.4	0.2	Setosa

	sepal.length	sepal.width	petal.length	petal.width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Recurrent Neural Network (ver.Python) (0)	2018.06.13
Tensorflow-GPU 설치 on Ubuntu16.04 (0)	2018.01.22
Multilayer Perceptron (ver.python) (편집 예정) (0)	2017.09.30
Convolutional Neural Network (ver. python) (0)	2017.06.25
단일신경망 Single Layer Neural Network (ver. python) (0)	2017.06.24

DeepStat

티스토리 뷰

Autoencoder (ver.Python)

DATA SET 출처¶

참고자료¶

함께보기¶