티스토리 뷰

Autoencoder (ver.Python)

Clustering using Autoencoder (ver.Python)

Iris

In [1]:
import pandas as pd
In [2]:
iris = pd.read_csv("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")
In [3]:
iris.head()
Out[3]:
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
In [4]:
iris.shape
Out[4]:
(150, 5)
In [5]:
iris.describe()
Out[5]:
sepal.length sepal.width petal.length petal.width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
In [6]:
iris.variety.value_counts()
Out[6]:
Versicolor    50
Setosa        50
Virginica     50
Name: variety, dtype: int64

Manipulating the data set

In [7]:
import numpy as np
In [8]:
np.random.seed(1)
test_obs = np.random.choice(150,50,replace=False)
In [9]:
training_set = iris.loc[set(iris.index) - set(test_obs)]
testing_set = iris.loc[test_obs]
In [10]:
training_y = training_set.variety
training_x = training_set.drop("variety",axis=1)
In [11]:
testing_y = testing_set.variety
testing_x = testing_set.drop("variety",axis=1)
In [12]:
training_y.value_counts()
Out[12]:
Virginica     36
Setosa        33
Versicolor    31
Name: variety, dtype: int64
In [13]:
testing_y.value_counts()
Out[13]:
Versicolor    19
Setosa        17
Virginica     14
Name: variety, dtype: int64
In [14]:
std_vec = training_x.std()
mean_vec = training_x.mean()
In [15]:
std_vec
Out[15]:
sepal.length    0.832159
sepal.width     0.414888
petal.length    1.762407
petal.width     0.777899
dtype: float64
In [16]:
mean_vec
Out[16]:
sepal.length    5.806
sepal.width     3.033
petal.length    3.772
petal.width     1.205
dtype: float64
In [17]:
training_x = (training_x - mean_vec) / std_vec
testing_x = (testing_x - mean_vec) / std_vec
In [18]:
training_x.shape
Out[18]:
(100, 4)

모형에 대한 자세한 설명은 생략하도록 하겠습니다.

Autoencoder, elu, softmax, Adam

  • encoding : [inner product -> elu] -> [inner product -> elu] -> [inner product -> elu] -> [inner product -> elu] -> [innerproduct -> softmax]

  • decoding : [inner product -> elu] -> [inner product -> elu] -> [inner product -> elu] -> [inner product -> elu] -> [innerproduct]

  • Loss : squared error loss

  • Optimizer : Adam

In [19]:
import tensorflow as tf
In [20]:
x = tf.placeholder("float", [None, 4])

함수 정의 : weight_variable - truncated normal distribution에서 난수 발생해서 원하는 모양으로 weight tensor를 만드는 함수.

함수 정의 : bias_variable - 원하는 모양으로 bias tensor를 만드는 함수.

In [21]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.truncated_normal(shape)
    return tf.Variable(initial)

모형 설정

Encoder

1 [inner product -> elu]

In [22]:
W_en1 = weight_variable([4,6])
b_en1 = bias_variable([6])
en_layer1 = tf.nn.elu(tf.matmul(x, W_en1) + b_en1)

2 [inner product -> elu]

In [23]:
W_en2 = weight_variable([6,6])
b_en2 = bias_variable([6])
en_layer2 = tf.nn.elu(tf.matmul(en_layer1, W_en2) + b_en2)

3 [inner product -> elu]

In [24]:
W_en3 = weight_variable([6,5])
b_en3 = bias_variable([5])
en_layer3 = tf.nn.elu(tf.matmul(en_layer2, W_en3) + b_en3)

4 [inner product -> elu]

In [25]:
W_en4 = weight_variable([5,4])
b_en4 = bias_variable([4])
en_layer4 = tf.nn.elu(tf.matmul(en_layer3, W_en4) + b_en4)

5 [inner product -> softmax]

In [26]:
W_en5 = weight_variable([4,3])
b_en5 = bias_variable([3])
en_layer5 = tf.nn.softmax(tf.matmul(en_layer4, W_en5) + b_en5)

Decoder

1 [inner product -> elu]

In [27]:
W_de1 = weight_variable([3,4])
b_de1 = bias_variable([4])
de_layer1 = tf.nn.elu(tf.matmul(en_layer5, W_de1) + b_de1)

2 [inner product -> elu]

In [28]:
W_de2 = weight_variable([4,5])
b_de2 = bias_variable([5])
de_layer2 = tf.nn.elu(tf.matmul(de_layer1, W_de2) + b_de2)

3 [inner product -> elu]

In [29]:
W_de3 = weight_variable([5,6])
b_de3 = bias_variable([6])
de_layer3 = tf.nn.elu(tf.matmul(de_layer2, W_de3) + b_de3)

4 [inner product -> elu]

In [30]:
W_de4 = weight_variable([6,6])
b_de4 = bias_variable([6])
de_layer4 = tf.nn.elu(tf.matmul(de_layer3, W_de4) + b_de4)

5 [inner product]

In [31]:
W_de5 = weight_variable([6,4])
b_de5 = bias_variable([4])
de_layer5 = tf.matmul(de_layer4, W_de5) + b_de5

Loss 와 Optimizer 설정

In [32]:
SSE = tf.reduce_sum(tf.square(tf.subtract(de_layer5,x)))
train_step = tf.train.AdamOptimizer(1e-3).minimize(SSE)
In [33]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
In [34]:
for i in range(100001):
    batch_obs = np.random.choice(training_x.index,50,replace=False)
    sess.run(train_step, feed_dict = {x: training_x.loc[batch_obs]})
    
    if i % 20000 == 0:
        train_accuracy = sess.run(SSE, feed_dict = {x: training_x})
        print("step " + str(i) + " training accuracy " + str(train_accuracy))
    
step 0 training accuracy 2421.058
step 20000 training accuracy 15.231902
step 40000 training accuracy 8.189543
step 60000 training accuracy 5.4591618
step 80000 training accuracy 4.884925
step 100000 training accuracy 4.710042

Clustering 결과

training

In [35]:
result = sess.run(tf.argmax(en_layer5, 1), feed_dict = {x: training_x})
In [36]:
tempdf = pd.concat(
    [pd.DataFrame(training_y),
     pd.DataFrame(result,index=training_y.index,columns=["clu"])
    ],axis=1)
In [37]:
tempdf.groupby(by=["clu"]).variety.value_counts()
Out[37]:
clu  variety   
0    Virginica     35
     Versicolor    28
1    Setosa        33
2    Versicolor     3
     Virginica      1
Name: variety, dtype: int64

testing

In [38]:
result = sess.run(tf.argmax(en_layer5, 1), feed_dict = {x: testing_x})
In [39]:
tempdf = pd.concat(
    [pd.DataFrame(testing_y),
     pd.DataFrame(result,index=testing_y.index,columns=["clu"])
    ],axis=1)
In [40]:
tempdf.groupby(by=["clu"]).variety.value_counts()
Out[40]:
clu  variety   
0    Versicolor    19
     Virginica     12
1    Setosa        17
2    Virginica      2
Name: variety, dtype: int64

이런 결과가 나온 것에 대해서 당연하게 의문점이 생긴다.

  1. Clustering이 제대로 되지 않고, iris의 대부분의 특성이 첫번째 feature에 몰려버리고 세부적인 특성이 나머지 두번째, 세번째 feature에 드러나도록 학습된 듯 하다.

  2. Autoencoder를 Clustering에 사용하기에는 무리인 듯 하다.


공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
TAG
more
«   2025/05   »
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
글 보관함