딥러닝 이야기 / Recurrent Neural Network (RNN) / 3. LSTM을 이용한 IMDb 영화 리뷰 감성 분류

LSTM을 이용한 IMDb 영화 리뷰 감성 분류

작성자: 여행 초짜
작성일: 2022.09.02

시작하기 앞서 틀린 부분이 있을 수 있으니, 틀린 부분이 있다면 지적해주시면 감사하겠습니다.

이전글에서는 sequence-to-sequence (seq2seq) 모델과 attention 메커니즘 예시를 몇 가지 살펴보았습니다. 이번글에서는 LSTM 모델을 이용하여 영화리뷰 데이터인 IMDb를 가지고 긍부정을 분류해주는 감성 분류 모델을 학습하고, attention mechanism을 적용 해보겠습니다. 구현은 python의 PyTorch를 이용하였습니다. 그리고 모델을 학습하면서 training set과 validation set의 loss와 accuracy 변화 뿐 아니라 attention score, 실제 긍부정 결과 샘플도 살펴보겠습니다.

그리고 seq2seq 모델과 attention에 대한 설명은 이전글을 참고하시기 바랍니다. 그리고 학습을 위한 코드는 GitHub에 올려놓았으니 아래 링크를 참고하시기 바랍니다(본 글에서는 모델의 구현에 초점을 맞추고 있기 때문에, 데이터 전처리 및 학습을 위한 전체 코드는 아래 GitHub 링크를 참고하시기 바랍니다).

그리고 텍스트를 토큰화 하기 위해 사용한 토크나이저는 word tokenizer를 구현하여 사용하였습니다. 물론 현재는 unknown 토큰 문제를 해결하기 위해 Word2Vec 글에서 설명한 byte-pair-encoding (BPE) 같이 subword 기반의 토크나이저가 많이 사용되지만, 본 글에서는 attention 모델이 결과를 예측하기 위해 어떠한 단어에 집중을 했는지 그 score를 보기 위해서 단어 기반의 토크나이저를 선택하였습니다.

오늘의 컨텐츠입니다.

LSTM 감성 분류 모델
Attention 모듈
감성 분류 모델 학습
감성 분류 모델 학습 결과

Attention을 이용한 LSTM 감성 분류 모델 GitHub 코드

Attention을 이용한 LSTM 감성 분류 모델

“ LSTM 감성 분류 모델
”

여기서는 감성 분류를 위한 LSTM 코드를 살펴보겠습니다. 코드는 PyTorch로 작성 되었으며, 문장 데이터를 input을 받아서 0, 1 사이의 값으로 내어주는 모델을 구성합니다.

class SentimentLSTM(nn.Module):
    def __init__(self, config, pad_token_id, device):
        super(SentimentLSTM, self).__init__()
        self.pad_token_id = pad_token_id
        self.device = device
        self.is_attn = config.is_attn
        self.hidden_size = config.hidden_size
        self.vocab_size = config.vocab_size
        self.num_layers = config.num_layers
        self.dropout = config.dropout

        self.embedding = nn.Embedding(self.vocab_size, self.hidden_size, padding_idx=self.pad_token_id)
        self.lstm = nn.LSTM(input_size=self.hidden_size,
                            hidden_size=self.hidden_size,
                            num_layers=self.num_layers,
                            batch_first=True,
                            dropout=self.dropout,
                            bidirectional=True)
        if self.is_attn:
            self.attention = Attention(self.hidden_size*2)
        self.fc = nn.Sequential(
            nn.Linear(self.hidden_size*2, 1),
            nn.Sigmoid()
        )
        self.relu = nn.ReLU()



    def init_hidden(self):
        h0 = torch.zeros(self.num_layers*2, self.batch_size, self.hidden_size).to(self.device)
        c0 = torch.zeros(self.num_layers*2, self.batch_size, self.hidden_size).to(self.device)
        return h0, c0


    def forward(self, x):
        self.batch_size = x.size(0)
        attn_output = None
        h0, c0 = self.init_hidden()

        x = self.embedding(x)
        x, _ = self.lstm(x, (h0, c0))
        if self.is_attn:
            attn_output = self.attention(self.relu(x))
            x = x * attn_output.unsqueeze(-1)
        x = torch.sum(x, dim=1)
        x = self.fc(x)

        return x.squeeze(1), attn_output

LSTM
위 코드에서 나오는 config 부분은 GitHub 코드에 보면 config.json이라는 파일에 존재하는 변수 값들을 모델에 적용하여 초기화 하는 것입니다.

4번째 줄: Vocab 중 pad token id 값.
6번째 줄: Attention 사용 여부.
7번째 줄: LSTM 모델 hidden dimension.
8번째 줄: 토큰화 한 Vocab size.
9번째 줄: LSTM 모델 레이어 수.
10번째 줄: LSTM 모델 dropout 비율.
12 ~ 18번째 줄: Embedding 레이어와 LSTM 모델 선언.
19 ~ 20번째 줄: Attention 모듈 선언(Attention 부분은 아래 코드에서 설명).
21 ~ 24번째 줄: 이진 분류를 위한 fully-connected 레이어 선언.
29 ~ 32번째 줄: LSTM hidden state 초기와 함수.
35 ~ 48번째 줄: 학습에 사용되는 input이 거치는 함수(x의 크기: batch size * max length).

“ Attention 모듈
”

위의 LSTM 모델에서 attention을 사용할건지 여부를 선택할 수 있었습니다. 만약 attention을 선택하게 된다면 아래의 attention 모듈에 LSTM의 output이 들어가게 됩니다.

class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()
        self.hidden_size = hidden_size

        self.attention  = nn.Sequential(
            nn.Linear(self.hidden_size, int(self.hidden_size/2)),
            nn.ReLU(),
            nn.Linear(int(self.hidden_size/2), 1)
        )


    def forward(self, x):
        x = self.attention(x)
        x = x.squeeze(2)
        x = F.softmax(x, dim=1)
        return x

Attention
위 코드에서 나오는 config 부분은 GitHub 코드에 보면 config.json이라는 파일에 존재하는 변수 값들을 모델에 적용하여 초기화 하는 것입니다.

4번째 줄: LSTM 모델에서 설정해준 hidden dimension.
6 ~ 10번째 줄: Attention 레이어.
13 ~ 17번째 줄: LSTM의 output이 들어가는 곳(x의 크기: batch size * max length * hidden dimension).

“ 감성 분류 모델 학습
”

이제 감성 분류 학습 코드를 통해 어떻게 학습이 이루어지는지 살펴보겠습니다. 아래 코드에 self. 이라고 나와있는 부분은 GitHub 코드에 보면 알겠지만 학습하는 코드가 class 내부의 변수이기 때문에 있는 것입니다. 여기서는 무시해도 좋습니다.

self.model = SentimentLSTM(self.config, self.color_channel).to(self.device)
self.criterion = nn.BCELoss()
self.optimizer = optim.Adam(self.model.parameters(), lr=self.lr)

for epoch in range(self.epochs):
    print(epoch+1, '/', self.epochs)
    print('-'*10)

    for phase in ['train', 'val']:
        if phase == 'train':
            self.model.train()
        else:
            self.model.eval()

        for i, (x, y) in enumerate(self.dataloaders[phase]):
            batch = x.size(0)
            x, y = x.to(self.device), y.to(self.device)
            self.optimizer.zero_grad()

            with torch.set_grad_enabled(phase=='train'):
                output = self.model(x)
                loss = self.criterion(output, y)

                if phase == 'train':
                    loss.backward()
                    self.optimizer.step()

학습에 필요한 것들 선언
먼저 위에 코드에서 정의한 모델을 불러오고 학습에 필요한 loss function, optimizer 등을 선언하는 부분입니다.

1 ~ 3번째 줄: Loss function, generator, discriminator 모델 선언 및 각각의 optimizer 선언.

감성 분류 모델 학습
다음은 감성 분류 모델 학습 부분입니다. 코드상에서는 5 ~ 26번째 줄에 해당하는 부분입니다.

17번째 줄: x는 IMDb 토큰화 된 데이터이며, y는 각 리뷰의 긍부정(0 or 1)의 label.
20 ~ 26번째 줄: loss를 계산하고, loss를 바탕으로 모델을 업데이트 하는 부분.

“ 감성 분류 모델 학습 결과
”

Attention 사용 하지 않은 감성 분류 모델 결과
먼저 attention을 사용하지 않은 모델의 학습 loss와 accuracy history입니다. 최대 accuracy는 0.883560입니다.

학습 loss history

학습 accuracy history

그리고 아래는 예측한 몇 개의 샘플입니다.

this movie was awful plain and simple the animation scenes had absolutely terrible graphics it was very clear to see that this film had about the budget of my [UNK] bill the acting was just as bad i've seen better acting in pornographic films i would seriously like the hour and twenty minutes of my life back in fact i [UNK] on imdb just so that other people don't get sucked into watching this like i did don't get me wrong though i love scifi films this one seemed more like the intro to a video game i'm glad i only spent a dollar to see this one the story line reminded me of the movie pitch black a prisoner on a ship in outer space escapes oh my goodness what are we gonna do i would not even let this play in the background of my house while i was cleaning bottom line here you can do better
******************************************
It is negative with a probability of 0.999
ground truth: 0.0
******************************************

the beloved rogue is a wonderful period piece it portrays [UNK] century paris in grand hollywood fashion yet offering a [UNK] side to existence there as it would be experienced by the poor and the snow it's constantly [UNK] about adding to the [UNK] of the setting brilliant the setting is enhanced by the odd cast of characters including [UNK] [UNK] and [UNK] a brilliant performance is turned in by john barrymore [UNK] only by the magnificent conrad [UNK] who portrays a [UNK] [UNK] louis [UNK] to perfection and yes [UNK] picks his nose on purpose pushing his portrayal to wonderfully [UNK] limits
******************************************
It is positive with a probability of 0.816
ground truth: 1.0
******************************************

Attention을 사용한 감성 분류 모델 결과
이제 attention을 사용한 모델의 학습 loss와 accuracy history입니다. 최대 accuracy는 0.884720이며 attention을 사용하지 않은 결과보다 살짝 높습니다.

학습 loss history

학습 accuracy history

그리고 아래는 예측한 몇 개의 샘플입니다.

i saw [UNK] on broadway and liked it a great deal i don't know what happened with the film version because it was dreadful perhaps some dialogue that works on stage just sounds incoherent on screen anyway i couldn't wait for this film to be over the acting is universally over the top only kevin spacey has it together and he seems like he knows he's in a bad movie and can't wait to get out
******************************************
It is negative with a probability of 0.784
ground truth: 0.0
******************************************

how do these guys keep going they're about 50 years old each and act as if they're only 30 they play 3 hours of music at every concert and barely break a sweat this dvd is their first concert in [UNK] brazil although the people don't speak english they try to [UNK] the words to the most famous rush songs and try to sing a foreign language at the concert with their best friends from tom [UNK] to the spirit of radio this concert dvd will keep you in the chair not wanting to pause or move away from the classics that you've listened to when you were young this is their [UNK] reunion tour started in 1974 i went to their [UNK] [UNK] concert and this was just as good although in [UNK] they didn't play [UNK] so i was upset they have [UNK] they have the trees they have [UNK] the pass driven [UNK] red [UNK] a [UNK] roll the bones [UNK] and much more 10 out of 10 because nothing else [UNK] if you never go to a rush concert then at least buy this dvd
******************************************
It is positive with a probability of 0.961
ground truth: 1.0
******************************************

그리고 attention을 사용하였기에 attention score도 어떻게 작용했는지 살펴보겠습니다. 위의 결과에서 보여준 리뷰는 너무 길어서 짧은 리뷰에 대해 나온 결과를 살펴보겠습니다. 먼저 긍정 리뷰에 대해서 모델은 great, spcial이란 단어에 좀 더 집중을 한 것을 볼 수 있습니다.

긍정 리뷰 단어별 attention score

부정 리뷰에 대해서는 모델이 story, mess 단어에 집중이 된 것을 볼 수 있습니다.

부정 리뷰 단어별 attention score

지금까지 LSTM을 통한 IMDb 감성 분류 구현 코드를 살펴보았습니다. 학습 과정에 대한 전체 코드는 GitHub에 있으니 참고하시면 될 것 같습니다. 다음에는 깊은 GRU 기반 seq2seq 기계 번역 구현 코드를 살펴보겠습니다.

태그 #LSTM #감성분류 #IMDb

⟨ 이전글
Sequence-to-Sequence (Seq2Seq) 모델과 Attention

다음글 ⟩
Seq2Seq 모델을 이용한 기계 번역