텍스트 마이닝 - Bag of Words

2020-11-22

텍스트 마이닝, 텍스트 전처리, Python, 데이터 전처리

Page content

공지

해당 포스트는 취업 준비반 대상 강의 교재로 파이썬 머신러닝 완벽가이드를 축약한 내용입니다.
- 매우 좋은 책이니 가급적 구매하시기를 바랍니다.

I. 개요

문서가 가지는 모든 단어(Words)를 문맥이나 순서를 무시하고 일괄적으로 단어에 대해 빈도 값을 부여하여 피처 값을 추출하는 모델을 말한다.
아래와 같은 세 개의 문장이 있다고 가정해본다.
- Doc 1: I love dogs.
- Doc 2: I hate dogs and knitting.
- Doc 3: Knitting is my hobby and passion.
위 문장을 각각의 행렬로 표현하면 아래와 같다.
BOW 모델의 장점은 쉽고 빠른 구축에 있기 때문에, 활용도는 높은 편이지만, BOW 기반의 NLP 연구는 잘 되지 않는다.
- 문맥 의미 부족
- 희소 행렬 문제, 위 그림에서 공백은 0을 의미하며, 이는 문장이 많으면 많을 수록 0의 값도 계속 늘어나는데, 이를 해결하기 위해 COO(Coordinate) 또는 CSR(Compressed Sparse Row)형식의 기법을 활용한다.

II. BOW 피처 벡터화

피처 벡터화는 간단하게 말하면 문서 내 텍스트를 단어의 횟수나 정규화된 빈도 값으로 데이터 세트 모델로 변경하는 것을 말한다.
보통 문서를 M이라고 하고, 단어를 N이라고 한다면, 행렬은 전체 문서의 개수 (M) X 전체 단어의 개수(N)으로 구성한다.
일반적으로 BOW의 피처 벡터화는 두 가지 방식이 존재한다.
- 카운트 기반의 벡터화
- TF-IDF(Term Frequency - Inverse Document Prequency) 기반의 벡터화

(1) 카운트 기반의 벡터화

단어 피처에 값을 부여하는 경우를 말한다. 간단한 예시를 활용한다.

from collections import Counter
import nltk
from nltk import word_tokenize
nltk.download('punkt')

# 텍스트
text = """Yesterday I went fishing. I don't fish that often, 
so I didn't catch any fish. I was told I'd enjoy myself, 
but it didn't really seem that fun."""

# 토큰화
tokens = word_tokenize(text)

# 모든 단어를 소문자화
lower_tokens = [t.lower() for t in tokens]

# Counter화
bow_simple = Counter(lower_tokens)

# 상위 10개의 단어 추출
print(bow_simple.most_common(10))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[('i', 5), ('.', 3), ("n't", 3), ('fish', 2), ('that', 2), (',', 2), ('did', 2), ('yesterday', 1), ('went', 1), ('fishing', 1)]

단어 피처에 값을 부여할 때 각 문서에서 해당 언어가 나타나는 횟수, 즉 Count를 부여하는 경우를 카운트 벡터화라고 한다.
이러한 개념을 바탕으로 파이썬 머신러닝 패키지인 sklearn에서는 CountVectorizer 클래스를 별도로 구현하였다.

from sklearn.feature_extraction.text import CountVectorizer

text = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
]
vect = CountVectorizer()
vect.fit(text)
vect.vocabulary_

{'and': 0,
 'document': 1,
 'first': 2,
 'is': 3,
 'one': 4,
 'second': 5,
 'the': 6,
 'third': 7,
 'this': 8}

X = vect.fit_transform(text)
print(vect.get_feature_names())
print(X.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]]

CountVectorizer 클래스에는 소문자 일괄 변환, 토큰화, 스텁 워드 필터링 등의 텍스트 전처리도 함께 수행한다.
입력 파라미터는 max_df, min_df, max_features, stop_words, n_gram_range, analyzer, token_pattern, tokenizer 등이 있다.
- 자세한 내용은 공식문서에서 확인한다.
- url: sklearn.feature_extraction.text.CountVectorizer

(2) TF-IDF 기반의 벡터화

카운트 기반의 벡터화의 문제점은 문서의 특징 보다는 언어의 특성상 문장에서 자주 사용될 수 밖에 없는 단어까지 높은 값을 부여하는 문제점을 보완한 방식이다.
즉, 개별 문서에서 자주 나타나는 단어에 높은 가중치를 주지만, 전반적으로 자주 나타나는 단어(예: 많은, 빈번하게 당연히)에 대해서는 페널티를 부여해서 단어에 대한 가중치의 균형을 맞추는 것을 진행한다.
만약, 문서 분류를 진행한다면, 카운트 기반의 벡터화 보다는 TF-IDF 기반의 방식을 사용하는 것이 더 좋은 예측 성능을 보장할 수 있다.
- TF-IDF의 공식은 gensim의 정식 문서에서 참조한다.
여기에서는 gensim을 사용해보자.
새로운 데이터를 가져와서 텍스트 전처리부터 진행한다.

from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize
my_documents = ['When the screenwriter John Ridley pitched “The Other History of the DC Universe,” a five-part comic book series that looks at pivotal events through the perspectives of several nonwhite DC heroes, he knew Black Lightning would be at its center. Ridley was 11 when he met the hero in 1977.',
                'This was a Black man who ostensibly looked like me with his own series,” Mr. Ridley said during a recent interview. An added bonus',
                'Mr. Ridley’s career includes writing for television and film, which earned him an Academy Award in 2014 for Best Adapted Screenplay for “12 Years a Slave.” But he is no stranger to comics. He wrote “The American Way,” which was published in 2006, about a group of heroes in the 1960s and their reaction to a Black member joining the team, and a sequel in 2017.'
                ]
punctuations= '?:!.,;“'
tokens = []
for sentence in my_documents:
  sentence_words = word_tokenize(sentence)
  for word in sentence_words:
    if word in punctuations:
      sentence_words.remove(word)
  tokens.append(sentence_words)

tokens[0][:10]

['When',
 'the',
 'screenwriter',
 'John',
 'Ridley',
 'pitched',
 'The',
 'Other',
 'History',
 'of']

글의 출처: https://www.nytimes.com/2020/11/21/us/john-ridley-comic-book.html
각 단어에 정수 인코딩을 하는 동시에 각 문서에서의 단어의 빈도수를 기록한다.
먼저, dictionary의 키 값[15]를 확인하여 위 문장과 동일한지 다시 체크해본다.
각 단어를 word_id, word_frequency로 표현하였다.

from gensim.corpora.dictionary import Dictionary

dictionary = Dictionary(tokens)
print(dictionary[19]) # key-value로 확인한다. 
print(len(dictionary)) # 학습된 단어의 수
Ridley_id = dictionary.token2id.get("Ridley")
print(dictionary.get(Ridley_id))
corpus = [dictionary.doc2bow(article) for article in tokens]
print(corpus[2][:5])

five-part
105
Ridley
[(2, 1), (8, 1), (9, 1), (12, 4), (20, 1)]

corpus를 활용하여 Gensim Bag_or_Words를 만든다.
defaultdict를 사용하면 존재하지 않는 키에 기본값을 할당하는 사전을 초기화할 수 있다. 인수를 제공함으로써 존재하지 않는 모든 키에 자동으로 기본값이 0으로 할당되도록 할 수 있다. 이는 단어 수를 저장하는 데 매우 이상적이다.
itertools.chain.from_iterable()는 하나의 연속적인 시퀀스인 것처럼 일련의 시퀀스를 통해 반복할 수 있게 해준다. 이 기능을 사용하면 말뭉치 개체(목록 목록)를 통해 쉽게 반복할 수 있다.

from collections import defaultdict
import itertools

doc = corpus[2]
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)


total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count

print("\n")
# 단어수가 많은 순대로 재 정렬
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# 정렬된 단어를 내림차순으로 출력한다. 
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

a 4
in 4
and 3
for 3
the 2


a 7
the 6
in 5
Ridley 4
” 4

이제 TF-IDF 기반의 벡터화를 진행한다.
아래 코드에서 tfidf는 일종의 모델링 이라고 생각하는 것이 좋다.
- TfidfModel에 관한 자세한 내용은 공식문서를 참조한다.

from gensim.models.tfidfmodel import TfidfModel

tfidf = TfidfModel(corpus)
tfidf_weights = tfidf[doc]

# 빈번하게 자주 나오는 단어 출력
print(tfidf_weights[:5])

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

[(9, 0.04430091034557146), (20, 0.04430091034557146), (22, 0.04430091034557146), (23, 0.17720364138228584), (29, 0.04430091034557146)]
and 0.3601014503954231
for 0.3601014503954231
to 0.24006763359694872
which 0.24006763359694872
in 0.17720364138228584

and, for, to와 같은 단어는 가장 자주 나오는 단어이지만, 중요하지 않은 단어다.

III. Reference

권철민. (2020). 파이썬 머신러닝 완벽가이드. 경기, 파주: 위키북스
https://www.tutorialspoint.com/gensim/gensim_creating_tf_idf_matrix.htm