Programmings

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

공지

본 Tutorial은 교재 핸즈온 머신러닝 2판를 활용하여 본 강사로부터 국비교육 강의를 듣는 사람들에게 자료 제공을 목적으로 제작하였습니다.
강사의 주관적인 판단으로 압축해서 자료를 정리하였기 때문에, 자세하게 공부를 하고 싶은 분은 반드시 교재를 구매하실 것을 권해드립니다.

개요

간단하게 Hexo 블로그를 만들어 본다.

I. 필수 파일 설치

1단계: nodejs.org 다운로드
- 설치가 완료 되었다면 간단하게 확인해본다.

$ node -v

2단계: git-scm.com 다운로드
- 설치가 완료 되었다면 간단하게 확인해본다.

$ git --version

3단계: hexo 설치
- hexo는 npm을 통해서 설치가 가능하다.

$ npm install -g hexo-cli

II. 깃허브 설정

두개의 깃허브 Repo를 생성한다.
- 포스트 버전관리 (name: myblog)
- 포스트 배포용 관리 (name: rain0430.github.io)
- rain0430 대신에 각자의 username을 입력하면 된다.
이 때, myblog repo를 git clone을 통해 적당한 경로로 내려 받는다.

$ git clone your_git_repo_address.git

III. 블로그 만들기

(옵션) 적당한 곳에 경로를 지정한 다음 다음과 같이 폴더를 만든다.

$ mkdir makeBlog # 만약 Powershell 이라면 mkdir 대신에 md를 쓴다. 
$ cd makeBlog

임의의 블로그 파일명을 만든다.

$ hexo init myblog
$ cd myblog
$ npm install
$ npm install hexo-server --save
$ npm install hexo-deployer-git --save

+ ERROR Deployer not found: git
+ hexo-deployer-git을 설치 하지 않으면 deploy시 위와 같은 ERROR가 발생합니다.

_config.yml 파일 설정
- 싸이트 정보 수정

title: 제목을 지어주세요
subtitle: 부제목을 지어주세요
description: description을 지어주세요
author: YourName

+ 블로그 URL 정보 설정

url: https://rain0430.github.io
root: /
permalink: :year/:month/:day/:title/
permalink_defaults:

+ 깃허브 연동

# Deployment
deploy:
  type: git
  repo: https://github.com/rain0430/rain0430.github.io.git
  branch: main

IV. 깃허브에 배포하기

배포 전, 터미널에서 localhost:4000 접속을 통해 화면이 뜨는지 확인해본다.

$ hexo generate
$ hexo server
INFO  Start processing
INFO  Hexo is running at http://localhost:4000 . Press Ctrl+C to stop.

화면 확인이 된 이후에는 깃허브에 배포한다.

공지

제 수업을 듣는 사람들이 계속적으로 실습할 수 있도록 강의 파일을 만들었습니다. 늘 도움이 되기를 바라며. 참고했던 교재 및 Reference는 꼭 확인하셔서 교재 구매 또는 관련 Reference를 확인하시기를 바랍니다.

사전작업

한글 시각화를 위해 나눔고딕 폰트를 불러온다.

!pip install psankey #  sankey diagram
%config InlineBackend.figure_format = 'retina'
!apt -qq -y install fonts-nanum

Requirement already satisfied: psankey in /usr/local/lib/python3.6/dist-packages (1.0.1)
fonts-nanum is already the newest version (20170925-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 33 not upgraded.

import matplotlib.font_manager as fm
import matplotlib as mpl
import matplotlib.pyplot as plt

fontpath = '/usr/share/fonts/truetype/nanum/NanumBarunGothic.ttf'
font = fm.FontProperties(fname=fontpath, size=9)
plt.rc('font', family='NanumBarunGothic')
mpl.font_manager._rebuild()
mpl.pyplot.rc('font', family='NanumGothic')
fm._rebuild()

위 코드 정렬이 끝난 이후에는 런타임-다시시작을 진행한다.
위 셀만 실행한다.

I. 빅쿼리 연동

지난 시간에 데이콘에서 내려받은 데이터를 빅쿼리에 넣는 작업을 진행하였다.
빅쿼리에 저장된 데이터를 구글 코랩으로 불러오려면 다음과 같이 진행한다.

(1) 사용자 계정 인증

구글 코랩을 사용해서 인증 절차를 밟도록 한다. 아래 소스코드는 변경시키지 않는다. 아래 절차대로 진행하면 된다. Gmail 인증 절차와 비슷하다.

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

I. 개요

간단하게 클래스를 만들어보고 한다.
지금까지 배운 내용을 바탕으로 Class를 활용한 머신러닝 예제를 작성한다.

II. Class와 Instance는 무엇인가?

클래스는 결국 함수의 연장선이다.
지금까지 함수가 얼마나 편한 것인지를 배웠다.
- 그런데, 시스템이 복잡해지면 함수 하나로 충분하지 않음을 알게 된다.
간단한 예시를 들어보자.

result = 0
def add(num):
  global result
  result += num
  return result

print(add(10))
print(add(20))

10
30

(1) 똑같은 기능을 가진 여러개의 함수

그런데, 2대의 계산기가 필요한 상황이 되었다고 가정하자.

result1 = 0
result2 = 0

def add1(num):
  global result1
  result1 += num
  return result1

def add2(num):
  global result2
  result2 += num
  return result2

print(add1(10))
print(add1(20))
print(add2(10))
print(add2(10))

똑같은 일을 하는데, 2개의 함수가 필요할까?
이 때 필요한 개념이 Class라는 개념이다.

(2) 사람 클래스

사람을 어떻게 정의할 수 있을까?
- 기본적인 신상: 이름, 나이, 취미
- 일상생활: 밥 먹기, 운동하기, 잠자기
그런데, 현재 우리 클래스 안에 몇명의 사람이 있는가?
- 이러한 신상을 파악하고, 일상생활을 파악하는 공통적인 설문지가 있으면 어떨까?
- 설문지라는 틀이 일종의 클래스가 되는 것이다.

class Human:
  name = "Rain" # 필드
  age = 30 # 필드
  
  def exercise(self): # 메서드(Method), 객체가 할 수 있는 행동(=객체가 할 수 있는 함수)
    print("운동합시다")

여기에서 self는 첫번째 매개변수를 의미하며, 자기 자신을 의미한다.

(3) 인스턴스

인스턴스는 설문지를 받는 개개인의 사람을 떠올리자.

rain = Human()
print(rain.name)
print(rain.age)
rain.exercise()

Rain
30
운동합시다

III. How to define Class

이제 지난 시간에 배운 함수의 정의들을 활용하여 Class를 만들어본다.

(1) 기본 함수 활용한 `Class` 예제

기본 __init__ 메서드를 클래스안에 만든다.
두명의 다른 사람을 만들어내자.

class Human:
  def __init__(self, name, age):
    self.name = name
    self.age = age

if __name__ == '__main__':
  evan     = Human("Evan", 9)
  minyoung = Human("SeoYoung", 8)
  print(evan.name)
  print(minyoung.name)

Evan
SeoYoung

(2) 인스턴스 메서드

이번에는 그 사람의 특징을 만들어 낼 수 있는 인스턴스 메서드를 작성 해보자.

class Human:
  def __init__(self, name, age):
    self.name = name
    self.age = age
  
  def describe(self):
    return f"{self.name}의 나이는 {self.age}이다."

  def say(self, content):
    return f"{self.name}이 {content}에 대해 말하고 있다."

if __name__ == '__main__':
  evan     = Human("Evan", 9)
  print(evan.describe())

Evan의 나이는 9이다.

(3) 상속

상속은 부모가 가진 재산을 자녀에게 물려줄 때는 쓰는 말
- 여기에서 말하는 재산은 함수, 변수 등을 의미함.
Asian, European, African Class를 만들어본다.
- 단, 이 때, skin color 함수를 부모인 Human 클래스에서 만든다.

class Human:
  def __init__(self, name, age):
    self.name = name
    self.age = age
  
  def describe(self):
    return f"{self.name}의 나이는 {self.age}이다."

  def say(self, content):
    return f"{self.name}이 {content}에 대해 말하고 있다."

  def skinColor(self, color):
    return f"{self.name}의 피부 색상은 {color}이다."

class Asian(Human):
  pass

class European(Human):
  pass

class African(Human):
  pass

if __name__ == '__main__':
  evan     = Asian("Jihoon", 9)
  rose     = European("Rose", 10)
  sam      = African("Sam", 11)

  print(evan.skinColor("Yellow"))
  print(rose.skinColor("white"))
  print(sam.skinColor("black"))

Jihoon의 피부 색상은 Yellow이다.
Rose의 피부 색상은 white이다.
Sam의 피부 색상은 black이다.

IV. 머신러닝 with Class

본 장에서는 Class를 활용하는 머신러닝에 대해 배울 예정이다.
이 때, 동일한 데이터를 사용하지만, Column명이 조금 상이한 경우 어떻게 대응하는지에 관해 작성한다.

(1) Colab + Drive 연동

먼저 weather.csv와 weather2.csv 파일 데이터를 구글 드라이브에 올려 놓는다.
그 후, 구글 드라이브와 구글 코랩을 연동한다.

# Mount Google Drive
from google.colab import drive # import drive from google colab

ROOT = "/content/drive"     # default location for the drive
print(ROOT)                 # print content of ROOT (Optional)
drive.mount(ROOT)           # we mount the google drive at /content/drive

/content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

다음 데이터 파일이 있는 곳으로 경로를 이동한다.

# import join used to join ROOT path and MY_GOOGLE_DRIVE_PATH
from os.path import join  

# path to your project on Google Drive
MY_GOOGLE_DRIVE_PATH = 'My Drive/Colab Notebooks/your/folder'
PROJECT_PATH = join(ROOT, MY_GOOGLE_DRIVE_PATH)

%cd "{PROJECT_PATH}"
!ls

/content/drive/My Drive/Colab Notebooks/your/folder
weather2.csv  weather.csv

(2) 두개의 서로 다른 데이터 확인

각각의 데이터의 컬럼명이 어떻게 다른지 확인해본다.

data = pd.read_csv("weather.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Formatted Date            96453 non-null  object 
 1   Summary                   96453 non-null  object 
 2   Precip Type               95936 non-null  object 
 3   Temperature (C)           96453 non-null  float64
 4   Apparent Temperature (C)  96453 non-null  float64
 5   Humidity                  96453 non-null  float64
 6   Wind Speed (km/h)         96453 non-null  float64
 7   Wind Bearing (degrees)    96453 non-null  float64
 8   Visibility (km)           96453 non-null  float64
 9   Loud Cover                96453 non-null  float64
 10  Pressure (millibars)      96453 non-null  float64
 11  Daily Summary             96453 non-null  object 
dtypes: float64(8), object(4)
memory usage: 8.8+ MB

data_columns = data.columns.tolist()
data_columns

['Formatted Date',
 'Summary',
 'Precip Type',
 'Temperature (C)',
 'Apparent Temperature (C)',
 'Humidity',
 'Wind Speed (km/h)',
 'Wind Bearing (degrees)',
 'Visibility (km)',
 'Loud Cover',
 'Pressure (millibars)',
 'Daily Summary']

data2 = pd.read_csv("weather2.csv")
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Formatted.Date            10000 non-null  object 
 1   Summary                   10000 non-null  object 
 2   Precip.Type               10000 non-null  object 
 3   Temperature..C.           10000 non-null  float64
 4   Apparent.Temperature..C.  10000 non-null  float64
 5   Humidity                  10000 non-null  float64
 6   Wind.Speed..km.h.         10000 non-null  float64
 7   Wind.Bearing..degrees.    10000 non-null  int64  
 8   Visibility..km.           10000 non-null  float64
 9   Loud.Cover                10000 non-null  int64  
 10  Pressure..millibars.      10000 non-null  float64
 11  Daily.Summary             10000 non-null  object 
dtypes: float64(6), int64(2), object(4)
memory usage: 937.6+ KB

(3) Class 작성

머신러닝 클래스를 작성한다. 자료는 Medium에 있는 코드를 적용했다.
- 참조: Using Classes for Machine Learning
- 날씨 데이터

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd


class Model:
    def __init__(self, datafile, model_type = None):
       self.datafile = datafile
       self.df = pd.read_csv(datafile)
       data_columns = ['Formatted Date', 'Summary', 'Precip Type', 
                       'Temperature (C)', 'Apparent Temperature (C)', 
                       'Humidity', 'Wind Speed (km/h)', 'Wind Bearing (degrees)',
                       'Visibility (km)', 'Loud Cover', 'Pressure (millibars)','Daily Summary']
       self.df.columns = data_columns

       if model_type == 'rf':
            self.user_defined_model = RandomForestRegressor() 
       else:
            self.user_defined_model = LinearRegression()
            
    def split(self, test_size):
        X = np.array(self.df[['Humidity', 'Pressure (millibars)']])
        y = np.array(self.df['Temperature (C)'])
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X, y, test_size = test_size, random_state = 42)
    
    def fit(self):
        self.model = self.user_defined_model.fit(self.X_train, self.y_train)
    
    def predict(self, input_value):
        if input_value == None:
            result = self.user_defined_model.predict(self.X_test)
        else: 
            result = self.user_defined_model.predict(np.array([input_value]))
        return result

위 클래스는 4개의 함수로 구성되어 있다.
- __init__: 데이터를 불러오고 간단한 Column명 처리를 진행했다.
- split: 데이터 셋을 나누는 함수이다.
- fit: 모형을 적합하는 함수이다.
- predict: 예측 후 결과를 보여준다.

(4) 클래스 재사용성 확인

이렇게 작성된 클래스를 a_model과 b_model을 확인해서 클래스 내의 함수를 재 사용해본다.

if __name__ == '__main__':
    a_model = Model(datafile="weather.csv", model_type=None)
    a_model.split(0.2)
    a_model.fit()    
    print(a_model.predict([.9, 1000]))
    print("Accuracy: ", a_model.model.score(a_model.X_test, a_model.y_test))

[6.83625473]
Accuracy:  0.39578560465686424

if __name__ == '__main__':
    b_model = Model(datafile="weather2.csv", model_type=None)
    b_model.split(0.3)
    b_model.fit()
    print(b_model.predict([.9, 1000]))
    print("Accuracy: ", b_model.model.score(b_model.X_test, b_model.y_test))

[7.15581866]
Accuracy:  0.35698882811391797

위와 같이 클래스를 활용한다면, 앞으로는 위 6줄만 추가하면, 추가적인 weather 데이터가 들어와도, 훨씬 간결하게 소스코드를 작성할 수 있다.

V. 추가 공부

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

I. 개요

나만의 함수를 작성해 본다.
실행가능한 함수를 만들어 본다.

II. 기존 내장 함수

함수는 특정 기능을 수행하는 코드를 의미한다.
함수는 Sum(), Len()을 의미한다.

x = [1,2,3,4,5]
print(sum(x))
print(len(x))

III 사용자 정의 함수 예제

이제 사용자 정의 함수를 사용하자.
함수 선언 시, def는 define의 약자다.

def my_avg(x):
  sum_var = sum(x)
  len_var = len(x)
  return sum_var / len_var

print(my_avg(x))

3.0

기본적인 사용정의 함수는 크게 매개 변수와 return으로 이루어진다.
- 이 때의 매개변수는, string, int, DataFrame 등 다양하게 올 수 있다.
- return의 의미는 일종의 함수를 실행한 뒤 반환하려는 output이다.

IV. 파이썬에서 실행모드 구축하기

사용자 정의 함수를 작성하였다면, 이제부터는 Main() 함수를 활용하여 코드를 빠르게 실행할 수 있도록 한다. 아래와 같이 코드를 작성하자.
PyCharm, VSCode에서 main.py 안에 아래와 같이 코드를 작성해본다.

# -*- coding: utf-8 -*-

def main():
  print("안녕하세요, Main() 입니다. ")

if __name__ == "__main__":
  main()

안녕하세요, Main() 입니다.

위 구문은 일종의 파이썬 파일을 실행시키기 위한 일종의 규약이라 이해하면 좋을 것 같다.
__name__은 모듈의 이름이 저장되는 곳이다.
__main__은 모듈의 시작점과 같다. main.py에서 __name__=="__main__"은 바꾸지 않는다.
이 때, 위 파일을 작성하면, shell에서 다음과 같이 실행한다.

~ $ python main.py
안녕하세요, Main() 입니다.

(1) 파일 구조

크게 두개의 파일을 작성할 것이다.
- calculation.py
- main.py
calculation.py에서 기본적인 코드를 작성한 뒤, main.py에서 해당 모듈을 가져와서 함수를 사용할 것이다.

(2) calculation.py 파일 작성 및 실행

간단하게 사직연산 함수를 작성한다.

# -*- coding: utf-8 -*-
a = 3
b = 4

def plus(a, b): 
  c= a+b
  return c

def subtract(a, b):
  c = a-b
  return c

if __name__ == "__main__":
  print("a + b =", subtract(a, b))
  print("a - b =", plus(a, b))

위와 같이 파일을 작성한 뒤 저장한다.
그리고, shell에서 다음과 같이 실행한다.

~ $ python calculation.py
a + b = -1
a - b = 7

(3) main.py 작성 및 실행

기존 calculation.py에서 if~이하의 구문을 제거한 후, 다시 저장한다.
이번에는 main.py에서 아래와 같이 파일을 작성한다.

# -*- coding: utf-8 -*-
import calculation as cal

a = 3
b = 4

def main():
  print("안녕하세요, Main() 입니다. ")
  print("a + b =", cal.subtract(a, b))
  print("a - b =", cal.plus(a, b))

if __name__ == "__main__":
  main()

그리고 이번에는 shell에서 main.py를 실행한다.

~ $ python main.py
안녕하세요, Main() 입니다. 
a + b = -1
a - b = 7

(4) 소결론

같은 파일 경로에 있다면, 다른 file에서 함수(=module)을 불러올 때는 패키지에서 파일을 불러오는 것처럼, import ~형태로 사용할 수 있다.
그리고, 각 파일명 안에는 다양한 작성할 수 있고, 또한 불러올 수 있다.

V. 두개의 폴더를 활용한 실행모드 구축

이제 한 폴더 안에서 다른 파일의 함수를 불러올 수 있음을 확인하였다.
이제는 두개의 폴더를 만들어 각각의 기능을 구현해본다.
폴더는 크게 두가지다.
- 사칙연산을 의미하는 arithmetic
- 데이터 전처리를 의미하는 dataPreprocessing
각각의 폴더안에 각 2가지의 파일을 작성할 예정이다.
마지막으로 main.py는 독립적으로 위치해 놓는다.

(1) Arithmetic 폴더

plus.py와 subtract.py안에 함수를 각각 저장한 뒤 작성한다.
- plus.py

# -*- coding: utf-8 -*-
def add(a, b): 
  c= a+b
  return c

subtract.py

# -*- coding: utf-8 -*-
def minus(a, b):
  c = a-b
  return c

(2) dataPreprocessing 폴더

파일 불러오기를 실행하는 importData.py와 데이터 전처리를 담당하는 processing.py에 해당하는 소스코드 작성 후 각각 저장한다.
- importData.py

# -*- coding: utf-8 -*-

def readData():
    print("~~ 데이터를 불러옵니다 ~~ ")
    data = "빅쿼리에서 불러오는 데이터"
    return data

processing.py

# -*- coding: utf-8 -*-
from time import sleep

def process_data(data):
    print("~~ 데이터 전처리 함수를 실행합니다! ~~")
    modified_data = data + "가 수정 완료 되었습니다."
    sleep(3)
    print("~~ 데이터 전처리가 끝났습니다! ~~")
    return modified_data

(3) main.py 수정

다음은 main.py를 아래와 같이 수정하도록 한다.

# -*- coding: utf-8 -*-
from arithmetic import plus as pl
from arithmetic import subtract as sub
from dataPreprocessing import processing
from dataPreprocessing import importData

a = 3
b = 4

def main():
  print("~~ 사칙 연산을 시작합니다 ~~ ")
  print("a + b =", sub.minus(a, b))
  print("a - b =", pl.add(a, b))
  print("~~ 사칙 연산을 종료합니다 ~~ ")

  ## 데이터 전처리 시작
  data = importData.readData()
  processing.process_data(data)

if __name__ == "__main__":
  main()

그 다음 shell에서 다음과 같이 실행하면 아래와 같은 결과물을 얻게 될 것이다.

~ $ python main.py
~~ 사칙 연산을 시작합니다 ~~ 
a + b = -1
a - b = 7
~~ 사칙 연산을 종료합니다 ~~ 
~~ 데이터를 불러옵니다 ~~ 
~~ 데이터 전처리 함수를 실행합니다! ~~
~~ 데이터 전처리가 끝났습니다! ~~

(4) 파일구조 리뷰

파일 구조는 아래와 같다.

.
├── arithmetic # 폴더
│   ├── plus.py
│   └── subtract.py
├── dataPreprocessing # 폴더
│   ├── importData.py
│   └── processing.py
├── main.py

VI. 결론

from의 각각의 폴더명을 의미한다.
import는 동일 폴더내의 다양한 py 파일명을 의미한다.
각각의 파일명안에 있는 다양한 함수들을 불러와서 사용할 수 있다.
프로젝트 파일을 제출할 시에는 위와 같이 main.py를 실행만 하더라도 결과가 나올 수 있도록 프로젝트 파일을 Refactoring해서 업로드하는 것을 추천한다.

개요

EDA를 진행할 때, 결측치가 있는 데이터를 시각화 하여 결측치 유무를 파악하였다.
- 참조: EDA with Housing Price Prediction - Handling Missing Values
이번 포스트에서는 결측치를 처리하는 코드를 작성할 것이다.

I. 구글 드라이브 연동

구글 코랩을 시작하면 언제든지 가장 먼저 해야 하는 것은 드라이브 연동이다.

from google.colab import drive # 패키지 불러오기 
from os.path import join  

ROOT = "/content/drive"     # 드라이브 기본 경로
print(ROOT)                 # print content of ROOT (Optional)
drive.mount(ROOT)           # 드라이브 기본 경로 Mount

MY_GOOGLE_DRIVE_PATH = 'My Drive/Colab Notebooks/inflearn_kaggle/' # 프로젝트 경로
PROJECT_PATH = join(ROOT, MY_GOOGLE_DRIVE_PATH) # 프로젝트 경로
print(PROJECT_PATH)

/content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab Notebooks/inflearn_kaggle/

%cd "{PROJECT_PATH}"

/content/drive/My Drive/Colab Notebooks/inflearn_kaggle

!ls

data  docs  source

필자는 inflearn_kaggle 폴더안에 data, docs, source 등의 하위 폴더를 추가로 만들었다.
즉, data 안에 다운로드 받은 파일이 있을 것이다.

II. 프로젝트 관련 패키지 불러오기

데이터 과학 관련 패키지들을 불러온다.

import pandas as pd
import pandas_profiling
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns

from IPython.core.display import display, HTML

시각화 진행 시, 일정한 Template을 작성한다.

%matplotlib inline
import matplotlib.pylab as plt

plt.rcParams["figure.figsize"] = (14,4)
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.color'] = 'r'
plt.rcParams['axes.grid'] = True

III. 데이터 불러오기

train.csv와 test.csv 데이터를 불러와서 확인해보는 소스코드를 작성한다.

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
print("data import is done")

data import is done

실제로 데이터가 잘 불러와졌는지 다시한번 확인해본다.

train.shape, test.shape

((1460, 81), (1459, 80))

이제 데이터 전처리를 위한 사적 작업이 모두 종료된 것이다.

IV. 결측치 처리방법 1 - “None” 또는 0으로 채우기

결측치 처리방법은 크게 2가지가 있다.
- 데이터 삭제 또는 데이터 채우기
데이터 삭제 진행하는 코드는 아래 튜토리얼을 확인한다.
- Pandas Data Handling 1편
본 포스트에서는 데이터를 채우는 방법에 대해 소개한다.

(1) 전체 데이터 결측치 확인

전체 데이터 결측치를 확인하는 방법은 아래와 같다.

new_df = train.copy()

new_df_na = (new_df.isnull().sum() / len(new_df)) * 100
new_df_na.sort_values(ascending=False).reset_index(drop=True)
new_df_na = new_df_na.drop(new_df_na[new_df_na == 0].index).sort_values(ascending=False)[:30]
new_df_na

PoolQC          99.520548
MiscFeature     96.301370
Alley           93.767123
Fence           80.753425
FireplaceQu     47.260274
LotFrontage     17.739726
GarageYrBlt      5.547945
GarageType       5.547945
GarageFinish     5.547945
GarageQual       5.547945
GarageCond       5.547945
BsmtFinType2     2.602740
BsmtExposure     2.602740
BsmtFinType1     2.534247
BsmtCond         2.534247
BsmtQual         2.534247
MasVnrArea       0.547945
MasVnrType       0.547945
Electrical       0.068493
dtype: float64

위 데이터를 가지고 시각화를 진행한다.

f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=new_df_na.index, y=new_df_na)
plt.xlabel('Variables', fontsize=24)
plt.ylabel('Percent of missing values', fontsize=24)
plt.title('Percent missing data by Variable', fontsize=32)
plt.show()

png

공지

사전작업

먼저 구글 코랩 내에서 pandas_profiling을 확인하기 위해 master.zip을 설치한다.
- ref. https://github.com/pandas-profiling/pandas-profiling
설치가 끝나면 구글코랩에서 런타임 다시 시작 한다.

!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
.
.
.
Successfully installed confuse-1.3.0 htmlmin-0.1.12 imagehash-4.1.0 pandas-profiling-2.8.0 phik-0.10.0 tangled-up-in-unicode-0.0.6 tqdm-4.47.0 visions-0.4.4

I. GBM, XGBoost, Lightgbm의 개요 및 실습

부스팅 알고리즘은 여러 개의 약한 학습기(Weak Learner)를 순차적으로 학습-예측하면서 잘못 예측한 데이터에 가중치 부여를 통해 오류 개선하며 학습하는 방식.

개요

본 포스트는 깃허브 프로젝트 관리에 관한 것이다.

I. 프로필 작성하기

이력서에 준하는 프로필 또는 유니크한 것을 살리는 것이 좋다.
깔끔한 정장을 입고, 이쁘게 화장을 하고, 면접을 보러가듯이 인사담당자가 보는 이로 하여금 좋은 인상을 심어줘야 한다.
성명, 이메일, 전화번호 등은 가급적 자세하게 기록해두는 것이 좋다.
프로젝트는 현재 진행중인 Pinned Repositories 상위 3~4개 정도 올려 놓는 것이 좋다.

만약에 현재 기여하는 오픈 소스 리퍼지토리가 있다면 반드시 메인 화면에 고정시킨다.

II. 깃허브 설치 및 연동

잔디밭은 일종의 열정과 성실함을 보여준다.
데이터 싸이언티스트(=개발자)는 매일마다 새로운 것을 배우고 성장해야 한다. 이 업종을 떠나기 전까지는 우리는 매일 조금씩이라도 코딩하고 꾸준히 커밋하는 습관을 길러야 한다.
그런데, 깃허브가 처음이라면 어떻게 해야할까요?

(1) 깃허브 가입하기

https://github.com/join

(2) 저장소 추가하기

저장소 추가하는 방법은 왼쪽 상단 [create repository]를 선택한다.

공지

사전작업

먼저 구글 코랩 내에서 pandas_profiling을 확인하기 위해 master.zip을 설치한다.
- ref. https://github.com/pandas-profiling/pandas-profiling
설치가 끝나면 구글코랩에서 런타임 다시 시작 한다.

!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
.
.
.
Successfully built pandas-profiling

I. 빅쿼리 연동

지난 시간에 데이콘에서 내려받은 데이터를 빅쿼리에 넣는 작업을 진행하였다.
빅쿼리에 저장된 데이터를 구글 코랩으로 불러오려면 다음과 같이 진행한다.

(1) 사용자 계정 인증

구글 코랩을 사용해서 인증 절차를 밟도록 한다. 아래 소스코드는 변경시키지 않는다. 아래 절차대로 진행하면 된다. Gmail 인증 절차와 비슷하다.

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

I. Kaggle에서 타이타닉 데이터 가져오기

캐글 데이터 가져오는 예제는 본 Kaggle with Google Colab에서 참고하기를 바란다.
먼저 kaggle 패키지를 설치한다.

!pip install kaggle

Requirement already satisfied: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.6)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.12.0)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.8.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.41.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.0.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2020.6.20)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.9)

kaggle 인증키를 업로드 하여 권한 부여 한다.

from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json





{'kaggle.json': b'{"username":"j2hoon85","key":"5a23c8dba5a151100b483a587eafdac8"}'}

!mkdir -p ~/.kaggle # 파일 생성
!mv kaggle.json ~/.kaggle/ # kaggle.json 파일 이동
!chmod 600 ~/.kaggle/kaggle.json # 권한 부여

!kaggle competitions list

Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.6 / client 1.5.4)
ref                                            deadline             category            reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
tpu-getting-started                            2030-06-03 23:59:00  Getting Started      Kudos        125           False  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       2958           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge      22881            True  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge       4985            True  
connectx                                       2030-01-01 00:00:00  Getting Started  Knowledge        673           False  
nlp-getting-started                            2030-01-01 00:00:00  Getting Started      Kudos       1455            True  
competitive-data-science-predict-future-sales  2020-12-31 23:59:00  Playground           Kudos       7626           False  
halite                                         2020-09-15 23:59:00  Featured              Swag        534           False  
birdsong-recognition                           2020-09-15 23:59:00  Research           $25,000        244           False  
landmark-retrieval-2020                        2020-08-17 23:59:00  Research           $25,000         53           False  
siim-isic-melanoma-classification              2020-08-17 23:59:00  Featured           $30,000       1672           False  
global-wheat-detection                         2020-08-04 23:59:00  Research           $15,000       1353           False  
open-images-object-detection-rvc-2020          2020-07-31 16:00:00  Playground       Knowledge         45           False  
open-images-instance-segmentation-rvc-2020     2020-07-31 16:00:00  Playground       Knowledge          9           False  
hashcode-photo-slideshow                       2020-07-27 23:59:00  Playground       Knowledge         50           False  
prostate-cancer-grade-assessment               2020-07-22 23:59:00  Featured           $25,000        765           False  
alaska2-image-steganalysis                     2020-07-20 23:59:00  Research           $25,000        869           False  
m5-forecasting-accuracy                        2020-06-30 23:59:00  Featured           $50,000       5558            True  
m5-forecasting-uncertainty                     2020-06-30 23:59:00  Featured           $50,000        909           False  
trends-assessment-prediction                   2020-06-29 23:59:00  Research           $25,000       1047           False

캐글에서 데이터를 내려받는다.

!kaggle competitions download -c titanic

Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.6 / client 1.5.4)
gender_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv: Skipping, found more recently modified local copy (use --force to force download)

!ls

chloevan_key.pem  gender_submission.csv  sample_data  test.csv	train.csv

이제, 판다스를 활용해서 데이터를 불러온다.

import pandas as pd

titanic_df = pd.read_csv(r'train.csv')
titanic_df.head(3)

print('titanic 변수 type:', type(titanic_df))

titanic 변수 type: <class 'pandas.core.frame.DataFrame'>

II. 데이터 핸들링을 위한 주요 함수 소개

본 장에서는 데이터 핸들링을 위한 몇가지 주요함수를 소개한다.

(1) value_counts()

value_counts()는 해당 칼럼값의 데이터 유형과 건수를 반환함

val_count = titanic_df['Embarked'].value_counts()
print(type(val_count))
print(val_count)

<class 'pandas.core.series.Series'>
S    644
C    168
Q     77
Name: Embarked, dtype: int64

(2) 데이터프레임 일부 삭제

drop()는 axis의 기준에 따라서 행과 열의 데이터를 삭제한다.
이 때, 주요 파라미터는 labels, inplace, axis에 따라 구분된다.
- labels: 컬럼명 또는 행의 인덱스
- inplace: 데이터 업데이트
- axis: 0은 행 방향, 1은 컬럼 방향
axis=1를 활용하여 우선 컬럼명을 삭제한다.

data = titanic_df.copy()
data_drop = data.drop(labels = 'Age', axis=1)
data_drop.head()

강의 홍보

공지

개요

I. 필수 파일 설치

II. 깃허브 설정

III. 블로그 만들기

IV. 깃허브에 배포하기

공지

사전작업

I. 빅쿼리 연동

(1) 사용자 계정 인증

강의 홍보

I. 개요

II. Class와 Instance는 무엇인가?

(1) 똑같은 기능을 가진 여러개의 함수

(2) 사람 클래스

(3) 인스턴스

III. How to define Class

(1) 기본 함수 활용한 Class 예제

(2) 인스턴스 메서드

(3) 상속

IV. 머신러닝 with Class

(1) Colab + Drive 연동

(2) 두개의 서로 다른 데이터 확인

(3) Class 작성

(4) 클래스 재사용성 확인

V. 추가 공부

강의 홍보

I. 개요

II. 기존 내장 함수

III 사용자 정의 함수 예제

IV. 파이썬에서 실행모드 구축하기

(1) 파일 구조

(2) calculation.py 파일 작성 및 실행

(3) main.py 작성 및 실행

(4) 소결론

V. 두개의 폴더를 활용한 실행모드 구축

(1) Arithmetic 폴더

(2) dataPreprocessing 폴더

(3) main.py 수정

(4) 파일구조 리뷰

VI. 결론

개요

I. 구글 드라이브 연동

II. 프로젝트 관련 패키지 불러오기

III. 데이터 불러오기

IV. 결측치 처리방법 1 - “None” 또는 0으로 채우기

(1) 전체 데이터 결측치 확인

공지

사전작업

I. GBM, XGBoost, Lightgbm의 개요 및 실습

개요

I. 프로필 작성하기

II. 깃허브 설치 및 연동

(1) 깃허브 가입하기

(2) 저장소 추가하기

공지

사전작업

I. 빅쿼리 연동

(1) 사용자 계정 인증

강의 홍보

I. Kaggle에서 타이타닉 데이터 가져오기

II. 데이터 핸들링을 위한 주요 함수 소개

(1) value_counts()

(2) 데이터프레임 일부 삭제

(1) 기본 함수 활용한 `Class` 예제