텍스트 마이닝 - 뉴스 분류

2020-12-10

텍스트 마이닝, 텍스트 전처리, Python, 데이터 전처리

Page content

공지

해당 포스트는 취업 준비반 대상 강의 교재로 파이썬 머신러닝 완벽가이드를 축약한 내용입니다.
- 매우 좋은 책이니 가급적 구매하시기를 바랍니다.

텍스트 분류 실습 - 뉴스그룹 분류 개요

사이킷런은 fetch_20newsgroups API를 이용해 뉴스그룹의 분류를 수행해 볼 수 있는 예제 데이터 활용 가능함.
희소 행렬에 분류를 효과적으로 처리할 수 있는 알고리즘은 로지스틱 회귀, 선형 서포트 벡터 머신, 나이브 베이즈 등임.

텍스트 정규화

fetch_20newsgroups()는 인터넷에서 데이터를 받은 후, 올리는 것이기 때문에 인터넷 연결 유무를 확인한다.

from sklearn.datasets import fetch_20newsgroups
news_data = fetch_20newsgroups(subset='all', random_state=156)
print(news_data.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

Target 클래스가 어떻게 구성돼 있는지 확인해 본다.

import pandas as pd
print('target 클래스의 값과 분포도 \n', pd.Series(news_data.target).value_counts().sort_index())
print('target 클래스의 이름들 \n', news_data.target_names)

target 클래스의 값과 분포도 
 0     799
1     973
2     985
3     982
4     963
5     988
6     975
7     990
8     996
9     994
10    999
11    991
12    984
13    990
14    987
15    997
16    910
17    940
18    775
19    628
dtype: int64
target 클래스의 이름들 
 ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Target 클래스의 값은 0부터 19까지 20개로 구성이 되어 있다.
각각의 개별 데이터가 텍스트로 어떻게 구성되어 있는지 확인해 본다.

print(news_data.data[1])

From: jlevine@rd.hydro.on.ca (Jody Levine)
Subject: Re: insect impacts
Organization: Ontario Hydro - Research Division
Lines: 64

I feel childish.

In article <1ppvds$92a@seven-up.East.Sun.COM> egreen@East.Sun.COM writes:
>In article 7290@rd.hydro.on.ca, jlevine@rd.hydro.on.ca (Jody Levine) writes:
>>>>
>>>>how _do_ the helmetless do it?
>>>
>>>Um, the same way people do it on 
>>>horseback
>>
>>not as fast, and they would probably enjoy eating bugs, anyway
>
>Every bit as fast as a dirtbike, in the right terrain.  And we eat
>flies, thank you.

Who mentioned dirtbikes? We're talking highway speeds here. If you go 70mph
on your dirtbike then feel free to contribute.

>>>jeeps
>>
>>you're *supposed* to keep the windscreen up
>
>then why does it go down?

Because it wouldn't be a Jeep if it didn't. A friend of mine just bought one
and it has more warning stickers than those little 4-wheelers (I guess that's
becuase it's a big 4 wheeler). Anyway, it's written in about ten places that
the windshield should remain up at all times, and it looks like they've made
it a pain to put it down anyway, from what he says. To be fair, I do admit
that it would be a similar matter to drive a windscreenless Jeep on the 
highway as for bikers. They may participate in this discussion, but they're
probably few and far between, so I maintain that this topic is of interest
primarily to bikers.

>>>snow skis
>>
>>NO BUGS, and most poeple who go fast wear goggles
>
>So do most helmetless motorcyclists.

Notice how Ed picked on the more insignificant (the lower case part) of the 
two parts of the statement. Besides, around here it is quite rare to see 
bikers wear goggles on the street. It's either full face with shield, or 
open face with either nothing or aviator sunglasses. My experience of 
bicycling with contact lenses and sunglasses says that non-wraparound 
sunglasses do almost nothing to keep the crap out of ones eyes.

>>The question still stands. How do cruiser riders with no or negligible helmets
>>stand being on the highway at 75 mph on buggy, summer evenings?
>
>helmetless != goggleless

Ok, ok, fine, whatever you say, but lets make some attmept to stick to the
point. I've been out on the road where I had to stop every half hour to clean
my shield there were so many bugs (and my jacket would be a blood-splattered
mess) and I'd see guys with shorty helmets, NO GOGGLES, long beards and tight
t-shirts merrily cruising along on bikes with no windscreens. Lets be really
specific this time, so that even Ed understands. Does anbody think that 
splattering bugs with one's face is fun, or are there other reasons to do it?
Image? Laziness? To make a point about freedom of bug splattering?

I've        bike                      like       | Jody Levine  DoD #275 kV
     got a       you can        if you      -PF  | Jody.P.Levine@hydro.on.ca
                         ride it                 | Toronto, Ontario, Canada

뉴스그룹 기사의 내용뿐만 아니라 뉴스그룹 제목, 작성자, 소속, 이메일 등의 다양한 정보를 가지고 있음.
그러나, 불필요한 부분들은 remove 파라미터를 이용하여 제거할 수 있음.
훈련 데이터와 테스트 데이터로 분류하는 코드를 작성해본다.

from sklearn.datasets import fetch_20newsgroups

# subset='train'으로 학습용 데이터만 추출, remove=('headers', 'footers', 'quotes')로 내용만 추출
train_news = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), random_state=156)

X_train = train_news.data
y_train = train_news.target

# subset='test'으로 테스트 데이터만 추출, remove=('headers', 'footers', 'quotes')로 내용만 추출
test_news = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), random_state=156)

X_test = test_news.data
y_test = test_news.target

print('학습 데이터 크기 {0}, 테스트 데이터 크기 {1}'.format(len(train_news.data), len(test_news.data)))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


학습 데이터 크기 11314, 테스트 데이터 크기 7532

피처 벡터화 변환

이제 피처 벡터화를 진행해야 하는데, 이 때에는 CountVectorizer를 이용해 학습 데이터의 텍스트를 피처 벡터화를 진행
테스트 데이터 역시 피처 벡터화 진행
- 이 때에는 테스트 데이터를 변환(transform) 해줘야 하며, 이 때, fit_transform() 사용 하면 안됨

from sklearn.feature_extraction.text import CountVectorizer

# Count Vectorization 피처 벡터화 변환 진행
cnt_vect = CountVectorizer()
cnt_vect.fit(X_train)

X_train_cnt_vect = cnt_vect.transform(X_train)

# 테스트 데이터를 feature 벡터화 변환 수행
X_test_cnn_vect = cnt_vect.transform(X_test)

print("학습 데이터 텍스트의 CountVectorizer Shape:", X_train_cnt_vect.shape)

학습 데이터 텍스트의 CountVectorizer Shape: (11314, 101631)

이렇게 만들어진 학습 데이터를 CountVectorizer로 피처를 추출한 결과 11314개의 문서에서, 단어가 101631개로 만들어진 것을 확인함

머신러닝 모델 학습/예측/평가

이제 로지스틱 회귀를 활용하여 뉴스그룹에 대한 분류를 예측해본다.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Logistic Regresion을 이용해 학습/예측/평가 수행 
lr_clf = LogisticRegression()
lr_clf.fit(X_train_cnt_vect, y_train)
pred = lr_clf.predict(X_test_cnn_vect)

print('CountVectorized Logistic Regression의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test, pred)))

CountVectorized Logistic Regression의 예측 정확도는 0.608


/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

Count 기반에서 TF-IDF 기반으로 벡터화 변경하여 예측 모델 수행함.

from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF 벡터화를 적용하여 학습 데이터 세트와 테스트 데이터 세트 변환. 
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(X_train)

X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

이번에는 LogisticRegression을 이용해 학습/예측/평가 수행.

lr_clf = LogisticRegression()
lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)

print("TF-IDF Logistic Regression의 예측 정확도는 {0:.3f}".format(accuracy_score(y_test, pred)))

TF-IDF Logistic Regression의 예측 정확도는 0.674

TF-IDF가 단순 카운트 기반보다 훨씬 높은 예측 정확도 제공.

모형 업그레이드 1단계

모형을 업그레이드 하기 위해서는 최상의 피처 전처리 수행이 필요함

# stop words 필터링 추가 & ngram을 기본 (1, 1)에서 (1, 2)로 변경해 피처 벡터화 적용
tfidf_vect = TfidfVectorizer(stop_words="english", ngram_range=(1, 2), max_df=300)
tfidf_vect.fit(X_train)

X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

lr_clf = LogisticRegression()
lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)

print("TF-IDF Logistic Regression의 예측 정확도는 {0:.3f}".format(accuracy_score(y_test, pred)))

TF-IDF Logistic Regression의 예측 정확도는 0.692

모형 업그레이드 2단계

이번에는 GridSearchCV를 이용하여 로지스틱 회귀의 하이퍼 파라미터 최적화를 수행한다.

from sklearn.model_selection import GridSearchCV
import time 
import datetime

start = time.time()

# 최적 C값 도출 튜닝 수행 / 과적합 방지용
params = {'C' : [0.01, 0.1]} # [0.01, 0.1, 1, 5, 10]
grid_cv_lr = GridSearchCV(lr_clf, param_grid=params, cv=2, scoring="accuracy", verbose=1)
grid_cv_lr.fit(X_train_tfidf_vect, y_train)
print('Logistic Regression best C parameter :', grid_cv_lr.best_params_)
# print('Logistic Regression Best C Parameter :', grid_cv_lr.best_params_)

sec = time.time()-start
times = str(datetime.timedelta(seconds=sec)).split(".")
times = times[0]
print(times)

Fitting 2 folds for each of 2 candidates, totalling 4 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  2.4min finished


Logistic Regression best C parameter : {'C': 0.1}
0:03:32

import time
import datetime
def bench_mark(start):
  sec = time.time() - start
  times = str(datetime.timedelta(seconds=sec)).split(".")
  times = times[0]
  print(times)

최적 C 값으로 학습된 grid_cv로 예측 및 정확도 평가

pred = grid_cv_lr.predict(X_test_tfidf_vect)
print('TF-IDF Vectorized Logistic Regression의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test, pred)))

TF-IDF Vectorized Logistic Regression의 예측 정확도는 0.645

사이킷런 파이프라인 활용한 머신러닝 수행

사이킷런의 Pipeline 클래스를 이용하여 피처 벡터화와 ML 알고리즘 학습/예측을 위한 코드 작성을 한 번에 진행 가능함.
Pipeline을 이용하여 데이터의 전처리와 머신러닝 학습 과정을 통일된 API 기반에서 처리할 수 있어서 보다 더 직관적인 ML 모델 코드를 생성할 수 있음.
또한 대용량 데이터의 피처 벡터화 결과를 별도 데이터로 저장하지 않고 스트림 기반에서 바로 머신러닝 알고리즘의 데이터로 입력할 수 있기 때문에 수행 시간 절약도 가능함.
다음은 텍스트 분류 예제 코드를 Pipeline을 이용해 재 작성한 코드이다.

from sklearn.pipeline import Pipeline

# TfidfVectorizer 객체를 tfidf_vect로, LogisticRegression 객체를 lr_clf로 생성하는 Pipeline 생성
pipeline = Pipeline([
                     ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df = 300)), 
                     ('lr_clf', LogisticRegression(C=10))
])

위 파이프라인을 활용하면 fit(), transform()과 LogisticRegression의 fit(), predict()가 필요 없음

start = time.time()

pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)

print('Pipeline을 통한 Logistic Regression의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test, pred)))

bench_mark(start)

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)


Pipeline을 통한 Logistic Regression의 예측 정확도는 0.701
0:05:55

지금까지 진행한 것은 단순하게 파이프라인을 활용해 머신러닝을 수행한 것이며, 이제 Pipeline + GridSearchCV를 적용한다.
파라미터를 최적화하려면 너무 많은 튜닝 시간이 소모되기 때문에 주의 하도록 하며, 총 27개의 파라미터 X CV 2 총 54번의 학습을 진행하기 때문에 오래 걸리니 유의니 시간에 유의하도록 한다.

start = time.time()

pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english')),
    ('lr_clf', LogisticRegression())
])

# Pipeline에 기술된 각각의 객체 변수에 언더바(_)2개를 연달아 붙여 GridSearchCV에 사용될 
# 파라미터/하이퍼 파라미터 이름과 값을 설정. . 
params = { 'tfidf_vect__ngram_range': [(1,1), (1,2), (1,3)],
           'tfidf_vect__max_df': [100, 300, 700],
           'lr_clf__C': [1,5,10]
}

# GridSearchCV의 생성자에 Estimator가 아닌 Pipeline 객체 입력
grid_cv_pipe = GridSearchCV(pipeline, param_grid=params, cv=2 , scoring='accuracy', verbose=1)
grid_cv_pipe.fit(X_train, y_train)
print(grid_cv_pipe.best_params_ , grid_cv_pipe.best_score_)

pred = grid_cv_pipe.predict(X_test)
print('Pipeline을 통한 Logistic Regression 의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test ,pred)))

bench_mark(start)

Reference

권철민. (2020). 파이썬 머신러닝 완벽가이드. 경기, 파주: 위키북스