Google Colab Intro

2020-05-30

Data Visualisation, Python, Google Colab

Page content

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

I. 들어가며

빅데이터 시대에 맞춰서 다양한 툴이 나오는 가운데, Google Colab은 가히 혁명적이라 할 수 있다.
과거 높은 사양의 컴퓨터에서만 수행할 수 있었던 머신러닝과 딥러닝을 구글 코랩의 환경에서 무료로 배울 수 있는 기회를 구글이 제공하기 시작했다.
간단하게 아래 소스코드를 실행하여 CPU와 GPU의 연산속도를 비교 해보자.
- GPU를 사용한 TensorFlow

II. Data Transformation 예제

이제 간단하게 데이터 가공의 예를 실습해보자.

(1) 딕셔너리에서 시리즈로 변환하기

다음의 소스코드를 실행하여 딕셔너리에서 시리즈로 변환하는 것을 실습해보자.

# pandas 불러오기
import pandas as pd

# key:value 형태로 딕셔너리를 만들고 temp_dic으로 저장
temp_dic = {'evan': 30, 'chloe': 27}
print(temp_dic)

{'evan': 30, 'chloe': 27}

# 시리즈로 변환하고 출력값 확인
data = pd.Series(temp_dic)
print(data)

evan     30
chloe    27
dtype: int64

위 출력값에서 인덱스는 evan과 chloe이다.

(2) 리스트에서 시리즈로 변환하기

이번에는 리스트에서 시리즈로 변환한다. 이 때 출력값의 인덱스가 어떻게 나타나는지 확인해본다.

import pandas as pd
temp_list = ['2020-05-29', 1.11, '가나다', 'ABC', 100, True]
data = pd.Series(temp_list)
print(data)

0    2020-05-29
1          1.11
2           가나다
3           ABC
4           100
5          True
dtype: object

이번에는 인덱스의 값이 자동으로 0부터 시작하는 것을 알 수 있다.

III. Data Visualisation 예제

이번에는 간단하게 시각화를 작성해본다.

import numpy as np
import matplotlib.pyplot as plt

N = 5
menMeans = (20, 35, 30, 35, 27)
womenMeans = (25, 32, 34, 20, 25)
menStd = (2, 3, 4, 1, 2)
womenStd = (3, 5, 2, 3, 3)
ind = np.arange(N)    # the x locations for the groups
width = 0.35       # the width of the bars: can also be len(x) sequence

p1 = plt.bar(ind, menMeans, width, yerr=menStd)
p2 = plt.bar(ind, womenMeans, width,
             bottom=menMeans, yerr=womenStd)

plt.ylabel('Scores')
plt.title('Scores by group and gender')
plt.xticks(ind, ('G1', 'G2', 'G3', 'G4', 'G5'))
plt.yticks(np.arange(0, 81, 10))
plt.legend((p1[0], p2[0]), ('Men', 'Women'))

plt.show()

png

Note: 한글 시각화는 별도로 세팅을 하지 않으면 구현이 어렵다.
- 추후에, 다시 한번 영상으로 제작하여 배포할 예정이지만, 첫 강의를 들으시는 분들은 아래 문서를 확인하여 주시기를 바란다.
- Kakao Arena 3 EDA on Google Colab

IV. 머신러닝 예제

이번에는 구글 코랩에서 진행하는 KNN을 활용하여 비지도학습의 분류 모형을 구현한다.
먼저 관련 모듈부터 설치한다.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

이번에는 데이터를 가져온다.

iris_dataset = load_iris()
print("종속 변수명: {}".format(iris_dataset['target_names']))
print("독립 변수명: {}".format(iris_dataset['feature_names']))
print("Type of data: {}".format(type(iris_dataset['data'])))
print("Shape of data: {}".format(iris_dataset['data'].shape))

종속 변수명: ['setosa' 'versicolor' 'virginica']
독립 변수명: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Type of data: <class 'numpy.ndarray'>
Shape of data: (150, 4)

Train, Test 데이터로 분류한다.

X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)

이번에는 모형을 학습시킨다.

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

학습된 모형을 테스트 하기 위해 가상의 데이터를 만든다.
- sepal length의 길이: 4 cm
- sepal width의 길이: 2.1 cm
- petal length의 길이: 1.2 cm
- petal width의 길이: 0.7 cm.

X_new = np.array([[4, 2.1, 1.2, 0.7]])
print("X_new.shape: {}".format(X_new.shape))

X_new.shape: (1, 4)

가상의 데이터를 예측함수에 추가하여 실제 어떤 종으로 분류되는지 확인한다.

prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(iris_dataset['target_names'][prediction]))

Prediction: [0]
Predicted target name: ['setosa']

setosa로 분류가 된 것을 확인할 수 있다.
이번에는 모형평가를 진행한다.

y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
print("Test set score (np.mean): {:.2f}".format(np.mean(y_pred == y_test)))
print("Test set score (knn.score): {:.2f}".format(knn.score(X_test, y_test)))

Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]
Test set score (np.mean): 0.97
Test set score (knn.score): 0.97

V. 결론

지금까지 진행하면서, 여러분들이 가장 놀라워야 하는 것은 데이터 분석 시, 필요로 하는 가장 대표적인 라이브러이인 pandas, numpy, matplot, scikit-learn과 같은 모듈을 설치하지 않았다.
즉, 데이터 분석을 하려면, 대부분의 교재에서 아나콘다를 설치해야 한다는 그러한 과정도 필요없다.
지금 바로 망설일 필요 없이 구글 코랩을 클릭하자. (무료다!)
이제 본격적으로 파이썬의 기초부터 머신러닝, 그리고 더 나아가 캐글 입문까지 진행하는 코스를 밟아본다.