[TIL]21.07.29 임베딩

중간에 가상환경이 한번 꼬여서 주피터 노트북으로 수업을 따라가다가 어쩔수없이 코랩으로 듣고있었는데

뒤로갈수록 용량이 큰 데이터를 다뤄야해서 더이상은미룰수 없다 싶어서 오늘 가상환경을 새로 설정하고 주피터 노트북 으로 수업을 듣기 시작했다.

import os

imdb_dir=('C://MLwork//aclImdb//aclImdb')

train_dir= os.path.join(imdb_dir,'train')

labels=[]
texts=[]

for label_type in ['neg','pos']:
  dir_name = os.path.join(train_dir,label_type)
  print(dir_name)
  for fname in os.listdir(dir_name):
    if fname[-4:] == '.txt': #확장자가 txt이면
      f=open(os.path.join(dir_name,fname), encoding='utf8')
      texts.append(f.read()) #내용은 순서대로 리스트에 집어 놓음
      f.close()
      #부정문 이면 0을, 긍정이면 1을  texts[]리스트와 동일한 index에 집어 넣음
      if label_type == 'neg':
        labels.append(0)
      else:
        labels.append(1)

print(texts[0])
print(labels[0])
print(texts[12500])
print(labels[12500])

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100 #100개 단어 이후는 버립니다.
training_samples = 200 #훈련 샘플은 200개입니다.
validation_samples = 10000 #검증 샘플은 10,000개입니다
max_words = 10000 #데이터셋에서 가장 빈도 높은 10,000개의 단어만 사용합니다.

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)# 단어 인덱스를 구축
#앞서 만들어진 토큰의 인덱스로만 채워진 새로운 배열을 만들어줌

sequences=tokenizer.texts_to_sequences(texts)
word_index=tokenizer.word_index #문장 전체의 딕셔너리
print('%s개의 고유한 토큰을 찾았습니다.'%len(word_index))
#패딩(padding)과정 : 길이를 똑같이 맞춰 주는 작업

data= pad_sequences(sequences,maxlen=maxlen)

labels = np.asarray(labels)
print('데이터 텐서의 크기:', data.shape)
print('레이블 텐서의 크기:',labels.shape)

# 데이터 훈현 세트와 검증세트로 분할합니다
# 샘플이 순서대로 이기 때문에(부정 샘플이 모두 나온후에 긍정 샘플이 나옵니다.) 셔플합니다
#data의 전체 수량만큼의 배열을 indices에 생성(할당)합니다.
data.shape[0]:25000
indices=np.arange(data.shape[0])
np.random.shuffle(indices)    #긍정,부정을 섞음(25000개의index 번호를 섞는다는의미)
data=data[indices]            #data가 섞임,같은 순서(indices)로
labels=labels[indices]        #labels가 섞임 , 같은 순서(indices)로

x_train=data[:training_samples]
y_train=labels[:training_samples]
x_val=data[training_samples: training_samples+validation_samples]
y_val=labels[training_samples: training_samples+validation_samples]

glove_dir = 'C:\MLwork\glove.6B'

embeddings_index={}
f = open(os.path.join(glove_dir,'glove.6B.100d.txt'),encoding='utf8')
for line in f:
    values = line.split()
    word=values[0] #단어
    #1번 인덱스 이후 마지막 까지 100차원의 임베딩 벡터 정도
    coefs = np.asarray(values[1:],dtype='float32')
    embeddings_index[word]=coefs
f.close()
print('%s개의 단어 벡터를 찾았습니다.'%len(embeddings_index))

embedding_dim=100

embedding_matrix=np.zeros((max_words,embedding_dim))
for word, i in word_index.items():
    #embeddings_index는 문자열과 100개의 벡터가 연결된 딕셔너리
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            #임베딩 인덱스에 없는 단어는모두 0이 됩니다. 위0초기화 유지
            embedding_matrix[i]=embedding_vector

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,Flatten,Dense

model= Sequential()
model.add(Embedding(max_words,embedding_dim,input_length=maxlen))
model.add(Flatten())
model.add(Dense(32,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.summary()

# 사전 훈련된 단어 임베딩을 로드했기 때문에 embedding층은 학습을 통해 업데이트되면 안됨

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable=False

model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history=model.fit(x_train,y_train,epochs=10,batch_size=32,validation_data=(x_val,y_val))
model.save_weights('pre_trained_glove_model.h5')

import matplotlib.pyplot as plt
acc=history.history['acc']
val_acc=history.history['val_acc']
loss=history.history['loss']
val_loss=history.history['val_loss']

epochs=range(1,len(acc)+1)

plt.plot(epochs,acc,'bo',label='Training acc')
plt.plot(epochs,val_acc,'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()

plt.plot(epochs,loss,'bo',label='Training loss')
plt.plot(epochs, val_loss,'b',label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

'First step > AI 기초반' 카테고리의 다른 글

[TIL]21.07.30 reuters 기사 분류 (0)	2021.07.30
[TIL]21.07.28 (0)	2021.07.28
[TIL]21.07.27 CNN 기초2 (0)	2021.07.27
[TIL]21.07.26 CNN (0)	2021.07.26
[TIL]21.07.23 mnist 사용 기초 (0)	2021.07.23

Joshuamogy

[TIL]21.07.29 임베딩

'First step > AI 기초반' 카테고리의 다른 글

티스토리툴바

[TIL]21.07.29 임베딩

'First step > AI 기초반' 카테고리의 다른 글

관련글

티스토리툴바