[Imple] Language Model (인코더 디코더)

스터디/AI

[Imple] Language Model (인코더 디코더)

_leezoee_ 2023. 4. 21. 10:39

정의

언어를 생성하기 위해서는 이전 time step의 output을 next time step의 input으로 feed 하고, 각 step에서 가장 높은 확률의 다음 단어를 선택(greedy selection)하거나 , 확률 분포에 따라 샘플링을 진행한다.

language모델이 적용되는 분야를 살펴보면 기계번역, 질의응답, 챗봇, 음성인식, 문장요약, 이미지 캡셔닝 등

Seq2Seq Language Translation (기계번역)

= Encoder-decoder model

Machine Translation Task 는 Language에 상관없이 동일한 모델을 적용 가능하다.

Encoder-decoder model은 활용도가 매우 높은 중요한 기술로 ㅈ볼 수 있다.

질의 응답(story + question => answer )에도 적용이 가능하다.

Seq2Seq은 시퀀스를 시퀀스로 바꾼다는 표현

story + question : input sequence => thought vector encoding

answer : output sequence => thought vector decoding

BLEU (Bilingual Evaluation Understudy)

기계번역의 품질을 평가하는 알고리즘으로 언어에 무관하며 이해하기 쉽고 계산이 쉬운 장점이 있음.

[0, 1]사이의 값을 가짐

Score 계산법을 보면 unigram 혹은 bigram 으로 이루어져있다.

ex) 예시로 설명하자면

정답 : "the weather is extremely good"

(the, weather), (weather, is), (is, extremely), (extremely, good)

예측 : "the weather is good"

(the, weather), (weather, is), (is, good)

BLEU = (the, weather)가 정답에 있는가, (weather, is)가 정답에 있는가, (is, good)이 정답에 있는가, = 1/3 + 1/3 + 0/3 = 0.666

요런식으로 계산할 수 있다.

Teacher Forcing

첫번째 단어에서 잘못 예측한 경우 시간이 지날수록 더 크게 잘못된 예측을 할 가능성이 증가함을 의미.

학습과정에서는 이미 정답을 알고 있고, 현재 모델의 예측 값과 정답과의 차이를 통해 학습하므로, 실제 값을 다음 단어 예측의 입력 값으로 사용한다.

한번에 전체 문장을 맞추는 것은 힘드므로 단어 단위로 교사가 교정해 주듯 전체 문장이 완성됨. => 훈련스텝에만 이렇게하고 훈련이 끝나고 실제 사용할때(production)는 teacher forcing을 쓰면 안됨 (autoprogressive 해야함)

즉, 인코더 디코더 모델은 훈련용 모델과 실 프로덕션 모델 두번에 거쳐 만듦

Encoder-Decoder Model

훈련모델과 사용모델 두가지 구현이 필요하다.

훈련모델에서는 Teacher Forcing 기법이 사용되고, 사용모델에서는 Text Generation 기법이 사용된다.

훈련모델에서의 Encoder + Decoder 구성도

마지막 Decoder output에서 정답하고 비교하고 그 오차를 역전파시키려면 네트워크가 연결 되어있어야함.

따라서 결합된 형태의 모델을 훈련모델로 사용

사용(예측)모델에서의 Encoder , Decoder 분리구성도

오차역전파 필요없으니 네트워크를 결합하지 않아도 되고, 디코더 모델만 따로 활용하는 형태로 사용.

Decoding Strategy

디코딩에서 어떤 단어를 선택할거냐를 결정하는 디코딩 전략들에 대한 설명이다.

1. Greedy 전략

softmax 분포중 가장 높은 확률(argmax)을 선택 => 매번 같은 결과가 나옴

2. Sampling 전략(사후확률분포)

분포 확률에 따라 Random sampling, 매번 번역이 바뀔 수 있음

np.random.choice(len(probs), p=probs)

3. Beam-search 전략

단순히 첫번째 단어를 argmax로 선택하면 1스텝에서라도 문법 상 실수를 할 경우, 전체 문장의 번역에 큰 실수가 되므로, 각각의 타임스텝 t 마다 b개의 sequence 후보 군을 유지하는 방법이다. => 결합확률

코드 작성 (영어->한국어 번역)

데이터

http://www.manythings.org/anki/

Tab-delimited Bilingual Sentence Pairs from the Tatoeba Project (Good for Anki and Similar Flashcard Applications)

Introducing Anki If you don't already use Anki, vist the website at http://ankisrs.net/ to download this free application for Macintosh, Windows or Linux. About These Files Any flashcard program that can import tab-delimited text files, such as Anki (free)

www.manythings.org

실행 환경은 구글 코랩 GPU로 작성하였다! (데이터전처리 시 데이터가 많지않아서 따로 학습, 검증데이터 split 안하고 모델학습진행할때 validation_split 값만 지정해주는 방식으로 진행)

사용할 라이브러리 임포트

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Embedding, Input, LSTM, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import plot_model

파라미터 지정

BATCH_SIZE = 64
NUM_SAMPLES = 10000
MAX_VOCAB_SIZE = 10000
EMBEDDING_DIM = 100
LATENT_DIM = 512

입력 데이터생성

1. input 텍스트(영어 데이터)

2. Teacher Forcing 용 input ,target 데이터 생성 (한국어 데이터, 디코더쪽) : 한 바이트씩 shift 시키면 됨

ex)

input : ['<sos>', '어제는', '좋은', '날이었다.']

Techer Forcing 용 target : ['어제는', '좋은', '날이었다.', '<eos>']

file_path 라는 변수로 데이터 파일을 미리 구글 마운트나 깃허브에 준비해두고 불러온 뒤,

Teacher Forcing 용 input, target 데이터를 작성한다.

# load in the data
eng_texts = []
kor_inputs = []
kor_targets = []

# load data
for line in open(file_path, 'r', encoding='utf-8'):
    if '\t' not in line: # \t 가 없는 line 은 skip
        continue
    # input 과 target translation 구분
    english, korean, attribution = line.split('\t')	
    # target input 과 output 을 teacher forcing 입력 구성
    input = '<sos>'+korean
    target = korean + '<eos>'
    
    eng_texts.append(english)
    kor_inputs.append(input)
    kor_targets.append(target)

다음으로 토큰화 진행

언어별로 토큰화를 진행해야한다.

영어 input text 토큰화

tokenizer_eng = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer_eng.fit_on_texts(eng_texts)

#잘 만들어졌는지 확인
tokenizer_eng.word_index

#수열로 변환
eng_sequences = tokenizer_eng.texts_to_sequences(eng_texts)

토큰화 결과 확인해보기

word2idx_eng = tokenizer_eng.word_index
print(f'unique input token 수 : {len(word2idx_eng)}')

num_words_eng = min(MAX_VOCAB_SIZE, len(word2idx_eng) + 1)
print("Input Text 의 단어 수 :", num_words_eng)

max_len_eng = max(len(s) for s in eng_sequences)
print("Input Text 의 최대 길이 :", max_len_eng)

한국어 input text 토큰화

영어랑 다르게 filters="" 파라미터로 넣어줘야함

#<sos>, <eos>같은 특수문자는 토큰화 그냥하면 없어지므로 filters="" 해줘서 안없어지도록 처리
tokenizer_kor = Tokenizer(num_words=MAX_VOCAB_SIZE, filters="")
tokenizer_kor.fit_on_texts(kor_inputs + kor_targets)

#수열(sequences)로 변환
kor_input_sequences = tokenizer_kor.texts_to_sequences(kor_inputs)
kor_target_sequences = tokenizer_kor.texts_to_sequences(kor_targets)

print(kor_input_sequences[1500])
print(kor_target_sequences [1500])

print([tokenizer_kor.index_word[idx] for idx in kor_input_sequences[1500]]
print([tokenizer_kor.index_word[idx] for idx in kor_target_sequences[1500]

토큰화 결과 확인해보기

word2idx_kor = tokenizer_kor.word_index
print(f'unique output tokens : {len(word2idx_kor)}')

num_words_kor = len(word2idx_kor) +1
print("Target 언어의 단어 수 :", num_words_kor)

max_len_kor = max(len(s) for s in kor_target_sequences)
print("Target 언어의 최대 길이 :", max_len_kor )

sequence padding 진행

기존에는 패딩을 post 로 뒤에 붙였는데 encoder는 thought vector 생성 목적이므로 pre(default)로 패딩한다.

decoder는 teacher forcing을 해야하므로 post 로 패딩한다.

encoder_inputs = pad_sequences(eng_sequences, maxlen=max_len_eng) #padding default는 설정안해주면 pre
print("encoder input shape :", encoder_inputs.shape)
print("encoder_inputs[0] : ", encoder_inputs[1500])

decoder_inputs = pad_sequences(kor_input_sequences, maxlen=max_len_kor, padding='post')
print("\ndecoder input shape :", decoder_inputs.shape)
print("decoder_inputs[0] : ", decoder_inputs[1500])

decoder_targets = pad_sequences(kor_target_sequences, maxlen=max_len_kor, padding='post')
print("\nencoder target shape :", decoder_targets.shape)
print("encoder_targets[0] : ", decoder_targets[1500])

이후에 pre트레이닝 된 word embedding 값을 전이학습(transfer learning) 시킨다.

1) Embedding Layer의 weight을 pre-trained model 로 초기화.

2) 미리 준비된 데이터를 가지고 초기 훈련값으로 시작하기 때문에 품질이 좋아짐

3) 이를 전이학습이라고 명칭, 임베딩 개념이 랭귀지 모델에 도입되면서 랭귀지 모델이 전이학습이 가능하게 됨.

사전학습 데이터 활용(스탠포드 오픈 데이터 , 영어만 구했음 한국어 읎어...)

https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt

glove.6B.100d.txt

Stanford's GloVe 100d word embeddings

www.kaggle.com

def make_embedding(num_words, embedding_dim, tokenizer, max_vocab_size):
    embeddings_dict = {}
    #미리 구글 드라이브에 마운트해둠
    output = './glove.6B.100d.txt'
    
    with open(output, encoding="utf8") as f:
        for i, line in enumerate(f):
            values = line.split() # 각 줄을 읽어와서 word_vector
            word = values[0] # 첫번째 값은 word
    		# 두번째 element 부터 마지막까지 100 개는 해당 단어의 임베딩 벡터의 값
           coefs = np.asarray(values[1:], dtype='float32')
           embeddings_dict[word] = coefs
           
    embedding_matrix = np.zeros((num_words, embedding_dim)) # zero 로 초기화
    
    print("word 갯수 =", num_words)
    print(embedding_matrix.shape)
    
    for word, i in tokenizer.word_index.items():
        if i < max_vocab_size:
    		embedding_vector = embeddings_dict.get(word)
    		if embedding_vector is not None: # 해당 word 가 없으면 all zero로 남겨둠
    			embedding_matrix[i] = embedding_vector
    
    return embedding_matrix

임베딩 레이어 작성

#위에 만들어둔 함수 호출
embedding_matrix = make_embedding(num_words_eng, EMBEDDING_DIM, tokenizer_eng, MAX_VOCAB_SIZE)
#케라스 임베딩, weights 초기값 embedding_matrix 할당, trainable=True 업데이트 하란 뜻
embedding_layer = Embedding(num_words_eng, EMBEDDING_DIM, weights=[embedding_matrix], trainable=True)

인코더 모델 생성

# Encoder
encoder_inputs_ = Input(shape=(max_len_eng, ), name='Encoder_Input')

# pre-trained embedding layer 사용
x = embedding_layer(encoder_inputs_) #encoder_inputs_ 입력값으로 사용
encoder_outputs, h, c = LSTM(LATENT_DIM, return_state=True)(x) #x가 입력값으로 사용

# encoder 는 hidden state and cell state 만 decoder 로 전달 --> thought vec
encoder_states = [h, c]
encoder_model = Model(inputs=encoder_inputs_, outputs=encoder_states)
encoder_model.summary()

디코더 모델 생성

# decoder 는 [h, c] 를 initial state 로 사용
decoder_inputs_ = input(shape=(max_len_kor,),name="Decoder_input")

# decoder word embedding 은 pre-trained vector 를 사용 않음
decoder_embedding = Embedding(num_words_kor, EMBEDDING_DIM)
decoder_inputs_x = decoder_embedding(decoder_inputs_)

# decoder for teacher-forcing
decoder_lstm = LSTM(LATENT_DIM, return_sequences=True, return_state=True)

# initial state = encoder [h, c]
decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_sstates)

# final layer
decoder_dense = Dense(num_words_kor, activation='softmax', name="Decoder_Output")
decoder_outputs = decoder_dense(decoder_outputs)

# Teacher-forcing 모델 생성, 모델 네트워크 연결
model_teacher_forcing = Model(inputs=[encoder_inputs_, decoder_inputs_] , outputs=decoder_outputs)

#원핫인코딩안하고 손실함수를 sparse_categorical_crossentorpy로 설정(정수)
model_teacher_forcing.compile(loss='sparse_categorical_crossentorpy', optimizer=RMSprop(0.001), metrics=['accuracy'])

#model compile and train
model_teacher_forcing.summary()

완료되면 만든 teacher-forcing 모델을 시각화해본다

plot_model(model_teacher_forcing)

형상까지 자세히 보고싶으면

plot_model(model_teacher_forcing, show_shapes=True)

여기까지 확인하면 teacher-forcing 모델을 훈련시킨다

history = model_teacher_forcing.fit([encoder_inputs, decoder_inputs], decoder_targets, 
					batch_size=BATCH_SIZE, epochs=30, validation_split=0.2)

훈련이 완료되면 정확도 손실값을 시각화 해본다

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,4))
ax1.plot(history.history['loss'], label='Loss')
ax1.plot(history.history['val_loss'], label='val_loss')
ax1.legend()

ax2.plot(history.history['accuracy'], label='accuracy')
ax2plot(history.history['val_accuracy'], label='val_accuracy')
ax2.legend()

plt.show()

여기까지 완료되면 만든 모델을 저장한다.

mmodel_teacher_forcing.save('내가만든모델.h5')

여기까지가 훈련모델을 만들었고, 사용을 위해서는 별도의 디코더 모델을 작성해야한다.

사용모델에서는 앞에서 학습된 weights를 모두 재사용하는 방식으로 작성되고,

encoder 와 decoder를 분리해 구성한다

# Decoder for inference
decoder_state_input_h = input(shape=(512, ), name='Decoder_hidden_h')
decoder_state_input_c = input(shape=(512, ), name='Decoder_hidden_c')
decoder_state_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_inputs_single = input(shape=(1, ), name='Decoder_input')
x = decoder_embedding(decoder_inputs_single)

# output, hidden states 를 저장
decoder_outputs, h, c = decoder_lstm(x, initial_state=decoder_state_inputs)
decoder_states = [h, c]

decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(inputs=[decoder_inputs_single]+decoder_state_inputs , 
						outputs=[decoder_outputs]+decoder_states)
                        
decoder_model.summary()

해당 모델을 시각화해서 확인한다

plot_model(decoder_model)

네트워크 형태까지 확인해보기

plot_model(decoder_model, show_shapes=True)

다음으로 이제 디코더를 이용해 시퀀스를 생성하면 된다(텍스트를 생성하는 부분)

def decode_sequence(input_seq):
	# encoder model을 이용하여 input을 state vector로 encoding
    state_value = encoder_model.predict(input_seq)
    
	# Generate empty target sequence of length 1.
    target_seq = np.zero((1,1))
    
	# target sequence 의 첫번째 character 를 start character (<sos>) 로 설정
    target_seq[0,0] = word2idx_kor['<sos>']
    
	# <eos> token이 decode 에서 생성되면 loop 에서 break
    eos = word2idx_kor['<eos>']
    
	# 번역문 생성
	output_sentence = []
	for _ in range(max_len_kor):
		output_token, h, c = decoder_model.predict([target_seq] + state_value)

	    #argmax 로 가장 확률 높은 단어 선택(greedy selection)
        idx = np.argmax(output_tokens[0, 0, :])    
	    if eos == idx: # <EOS> token 끝
	  	    break
	    if idx > 0: #idx 0 은 zero padding 된 sequence 이므로 '' 다음단어로 처리
        	word=tokenizer_kor.index_word[idx]
            output_sentence.append(word)
	
        #생성된 word 를 decoder 의 다음 input 으로 사용
        target_seq[0, 0] = idx
        
        #상태 업데이트
        state_value = [h, c]
        
    return ''.join(output_sentence)
    
#잘 만들어졌나 확인    
for _ in range(5):
    i = np.random.choice(len(original_texts))
    input_seq = encoder_inputs[i:i+1]
    translation = decode_sequence(input_seq)
    print('-')
    print('Input : ', original_text[i])
    print('Translation : ',translation)

임시 테스트 데이터로 확인해보기

txt = "Your lips are red."
input_sequenceg = tokenizer_original.texts_to_sequences([txt])
encoder_input = pad_sequences(input_sequence, maxien=max_len_original)

translation = decode_sequence(encoder_input)

print('-')
print('Input : ',txt)
print('Translation : ',translation)

아마 데이터가 처음 소스작성 할때는 3천개였는데 지금 5천개까지 늘어있는걸 보니 성능이 좀 더 좋아지긴 할 듯 하다.

그래도 짧은 문장들은 얼추 번역이 잘 되는걸 확인할 수 있었다