Python

Flask Heroku Pandas Postgres 튜토리얼

강의 홍보

개요

  • Flask 기본적인 작동 원리를 배운다.
  • Postgres와 SQLAlchemy를 활용한다.
  • Heroku에 배포를 진행한다.

사전준비

  • Github에 각 개인에게 맞는 Github Repo를 생성한다.
    • 주의 : 반드시 Unique하게 작성해야 한다.
  • 가상환경 설정을 진행한다.
  • PostgreSQL DB 설정은 다음을 참조한다.
virtualenv venv
  • 주요 라이브러리를 설치한다.
pip install Flask psycopg2 SQLAlchemy pandas gunicorn

배포

  • 프로젝트 시작 전에는 항상 선 배포를 하고 시작한다.
  • app.py를 만들어본다.
# -*- coding: utf-8 -*-

from flask import Flask

app = Flask(__name__)

@app.route("/")
def index():
    return "Hello World!"
  • 3개의 파일이 필요하다.

Scikit-Learn ML Model with Java

강의 홍보

개요

  • scikit-learn 모델을 JAVA에서 구동 시켜야 한다.
  • 크게 3가지 방법론이 존재한다.(원문 참조 : Moving from Python to Java to deploy your machine learning model to production
    • embed : Java 코드 내에서 직접 Python 코드 구현 방법. Jython을 이용하지만, 문제는 Scikit-Learn은 지원하지 않는다. 따라서 일반적으로 Flask API를 통해서 지원하기도 한다.
    • transpile : Scikit-Learn 모델을 전달하는 방법. sklearn-porterm2cgen를 고려할 수 있다.
    • redevelop : Scikit-Learn 모델을 H20나 Spark의 MLib을 구현한 후 배포하는 방법이다.

새로운 제안

  • 그러나 다각도로 여러 시도 및 라이브러리가 존재함.
  • 오늘 소개할 라이브러니느 sklearn2pmml임.
  • JAVA가 설치가 되어 있어야 함.
    • 로컬 환경은 각자 환경변수를 추가해야 함. 이 부분은 생략함.
  • 구글 코랩에서는 아래와 같이 설치가 가능함.
  • Java 코드 참조 : Using scikit-learn model into Java app
!apt update -q
!apt-get install -q openjdk-11-jdk-headless

%env JAVA_HOME "/usr/lib/jvm/java-11-openjdk-amd64"
Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:5 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Ign:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:8 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:10 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:14 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [2,867 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [1,075 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2,297 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [3,302 kB]
Fetched 9,797 kB in 6s (1,534 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
49 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists...
Building dependency tree...
Reading state information...
openjdk-11-jdk-headless is already the newest version (11.0.15+10-0ubuntu0.18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
env: JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
  • 설치가 끝난 후에는 sklearn2pmml을 설치한다.
!pip install sklearn2pmml
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn2pmml
  Downloading sklearn2pmml-0.84.2.tar.gz (6.3 MB)
     |████████████████████████████████| 6.3 MB 5.1 MB/s 
[?25hRequirement already satisfied: joblib>=0.13.0 in /usr/local/lib/python3.7/dist-packages (from sklearn2pmml) (1.1.0)
Requirement already satisfied: scikit-learn>=0.18.0 in /usr/local/lib/python3.7/dist-packages (from sklearn2pmml) (1.0.2)
Requirement already satisfied: sklearn-pandas>=0.0.10 in /usr/local/lib/python3.7/dist-packages (from sklearn2pmml) (1.8.0)
Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.18.0->sklearn2pmml) (1.21.6)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.18.0->sklearn2pmml) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.18.0->sklearn2pmml) (3.1.0)
Requirement already satisfied: pandas>=0.11.0 in /usr/local/lib/python3.7/dist-packages (from sklearn-pandas>=0.0.10->sklearn2pmml) (1.3.5)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.11.0->sklearn-pandas>=0.0.10->sklearn2pmml) (2022.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.11.0->sklearn-pandas>=0.0.10->sklearn2pmml) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=0.11.0->sklearn-pandas>=0.0.10->sklearn2pmml) (1.15.0)
Building wheels for collected packages: sklearn2pmml
  Building wheel for sklearn2pmml (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn2pmml: filename=sklearn2pmml-0.84.2-py3-none-any.whl size=6298569 sha256=f6a564303dd11e9ce38b6c7b4ec4e8a4632499c61d3ee69f40070a1f1983ef80
  Stored in directory: /root/.cache/pip/wheels/bb/e4/71/d3c8f75fae8d7f387f82099ec8cdd6b83cf1dccaeb3561c7b6
Successfully built sklearn2pmml
Installing collected packages: sklearn2pmml
Successfully installed sklearn2pmml-0.84.2

샘플 코드 작성

  • 간단한 모형을 구축한다.
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn2pmml import PMMLPipeline, sklearn2pmml
import pandas as pd

# 데이터 불러오기
df = load_diabetes()
X = pd.DataFrame(columns = df.feature_names, data = df.get('data'))
y = pd.DataFrame(columns = ['target'], data = df.get('target'))

# 데이터셋 분리하기
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=42)

# Pipeline 구축하기
pipeline = PMMLPipeline([ ('regressor', DecisionTreeRegressor()) ])

# 모형 학습 시키기
pipeline.fit(X_train, y_train)
PMMLPipeline(steps=[('regressor', DecisionTreeRegressor())])
  • 이번에는 모형을 테스트 한다.
from sklearn.metrics import mean_absolute_error
y_pred = pipeline.predict(X_test)
print('MAE: ', mean_absolute_error(y_pred, y_test))
MAE:  65.42222222222222
  • 이제 모형을 pmml 파일 형태로 저장한다.
# 모델 내보내기
sklearn2pmml(pipeline, 'model.pmml', with_repr = True)

Scikit-Learn Model into JAVA

  • 해당 모형을 이제 이제 Java에서 가져오도록 한다.
  • Java 코드에서는 pmml4s library를 이용한다.
import org.pmml4s.model.Model;

import java.util.*;

public class Main {

    private final Model model = Model.fromFile(Main.class.getClassLoader().getResource("model.pmml").getFile());

    public Double getRegressionValue(Map<String, Double> values) {
        Object[] valuesMap = Arrays.stream(model.inputNames())
                .map(values::get)
                .toArray();

        Object[] result = model.predict(valuesMap);
        return (Double) result[0];
    }

    public static void main(String[] args) {
        Main main = new Main();
        Map<String, Double> values = Map.of(
                "age", 20d,
                "sex", 1d,
                "bmi", -100d,
                "bp", -200d,
                "s1", 1d,
                "s2", 2d,
                "s3", 3d,
                "s4", 4d,
                "s5", 5d,
                "s6", 6d
        );

        double predicted = main.getRegressionValue(values);
        System.out.println(predicted);
    }
}

MLFlow with Scikit-Learn

개요

  • Scikit-Learn 모델을 만든 후, MLFlow로 모델을 배포한다.
  • 머신러닝 코드에 대한 설명은 생략한다.
  • 가상환경 설정에 관한 내용도 생략한다.

라이브러리 불러오기

  • 기존 코드에서 mlflow 라이브러리만 추가한다.
%matplotlib inline

import numpy as np 
import pandas as pd 
import matplotlib as mpl
import matplotlib.pyplot as plt 
import sklearn
import seaborn as sns
import mlflow 
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, plot_roc_curve, confusion_matrix 

print(f"numpy version {np.__version__}")
print(f"pandas version {pd.__version__}")
print(f"matplotlib version {mpl.__version__}")
print(f"seaborn version {sns.__version__}")
print(f"sklearn version {sklearn.__version__}")
print(f"MLFlow version {mlflow.__version__}")
numpy version 1.23.1
pandas version 1.4.3
matplotlib version 3.5.2
seaborn version 0.11.2
sklearn version 1.1.1
MLFlow version 1.27.0

데이터 불러오기

  • 데이터를 불러오도록 한다.
DATA_PATH = "C:\\Users\\your_id\\Desktop\\mlops_tutorial\\data\\creditcard.csv"

df = pd.read_csv(DATA_PATH)
print(df.head())
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   

        V26       V27       V28  Amount  Class  
0 -0.189115  0.133558 -0.021053  149.62      0  
1  0.125895 -0.008983  0.014724    2.69      0  
2 -0.139097 -0.055353 -0.059752  378.66      0  
3 -0.221929  0.062723  0.061458  123.50      0  
4  0.502292  0.219422  0.215153   69.99      0  

[5 rows x 31 columns]

로지스틱 모형 만들기

  • 기존 코드를 참조하여 모델을 만든다.
  • 데이터셋 분리를 한다.

normal = df[df['Class'] == 0].sample(frac=0.5, random_state=42).reset_index(drop=True)
anomaly = df[df['Class']==1]

normal_train, normal_test = train_test_split(normal, test_size = 0.2, random_state=42)
anomary_train, anomary_test = train_test_split(anomaly, test_size = 0.2, random_state=42)

normal_train, normal_validate = train_test_split(normal_train, test_size = 0.25, random_state=42)
anomary_train, anomary_validate = train_test_split(anomary_train, test_size = 0.25, random_state=42)

normal_train.shape, normal_validate.shape, anomary_train.shape, anomary_validate.shape
((85294, 31), (28432, 31), (294, 31), (99, 31))
  • 최종 학습, 테스트 및 검증 세트를 생성하기 위해 각각의 정상 및 이상 데이터 분할을 연결해야 한다.
X_train = pd.concat((normal_train, anomary_train))
X_test = pd.concat((normal_test, anomary_test))
X_validate = pd.concat((normal_validate, anomary_validate))

y_train = np.array(X_train["Class"])
y_test = np.array(X_test["Class"])
y_validate = np.array(X_validate["Class"])

X_train = X_train.drop("Class", axis = 1)
X_test = X_test.drop("Class", axis = 1)
X_validate = X_validate.drop("Class", axis = 1)

X_train.shape, X_validate.shape, X_test.shape, y_train.shape, y_validate.shape, y_test.shape
((85588, 30), (28531, 30), (28531, 30), (85588,), (28531,), (28531,))
  • 표준화를 진행한다.
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_validate = scaler.transform(X_validate)

MLFlow를 통한 학습 및 평가

  • MLFlow 기능으로 검증에 사용하는 사용자 정의 함수를 만들어 본다.
  • 여기서 핵심은
    • mlflow.log_metric() 함수를 통해 지표를 로깅할 수 있음
    • mlflow.log_artifact() 함수를 통해 그래프를 저장할 수 잇음.
def train(sk_model, X_train, y_train):
    sk_model = sk_model.fit(X_train, y_train)
    train_acc = sk_model.score(X_train, y_train)
    mlflow.log_metric("train_acc", train_acc)
    
    print(f"Train Accuracy: (train_acc:.3%)")
    

def evaluate(sk_model, X_test, y_test):
    eval_acc = sk_model.score(X_test, y_test)
    preds = sk_model.predict(X_test)
    auc_score = roc_auc_score(y_test, preds)
    mlflow.log_metric("eval_acc", eval_acc)
    mlflow.log_metric("auc_score", auc_score)
    
    print(f"Auc Score : {auc_score:.3%}")
    print(f"Eval Score : {eval_acc:.3%}")
    roc_plot = plot_roc_curve(sk_model, X_test, y_test, name="Scikit-Learn ROC Curve")
    plt.savefig("sklearn_roc_plot.png")
    plt.show()
    plt.clf()
    conf_matrix = confusion_matrix(y_test, preds)
    ax = sns.heatmap(conf_matrix, annot=True, fmt='g')
    ax.invert_xaxis()
    ax.invert_yaxis()
    plt.ylabel("Actual")
    plt.xlabel("Predicted")
    plt.title("Confusion Matrix")
    plt.savefig("sklearn_conf_matrix.png")
    mlflow.log_artifact("sklearn_roc_plot.png")
    mlflow.log_artifact("sklearn_conf_matrix.png")

MLFlow 실행 로깅 및 확인

  • 실제 실험 이름을 설정하고, MLFlow 실행 시작. 해당 코드를 모두 실행한다.
# 모델 설정
sk_model = LogisticRegression(random_state=None, max_iter=400, solver='newton-cg')

# 실험 이름 설정
mlflow.set_experiment("sklearn_experiment")

# 해당 이름으로 실행 배치
with mlflow.start_run():
    train(sk_model, X_train, y_train)
    evaluate(sk_model, X_test, y_test)
    
    # 하나의 MLFlow 실행 컨텍스트에서 모든 코드를 묶을 수 있음. 
    # 참조 : https://mlflow.org/docs/latest/models.html#model-customization
    mlflow.sklearn.log_model(sk_model, 'log_reg_model')
    
    # 본질적으로 모델과 지표가 로깅되는 현재 실행을 가져오고 출력함. 
    print("Model run: ", mlflow.active_run().info.run_uuid)
mlflow.end_run()
Train Accuracy: (train_acc:.3%)
Auc Score : 86.355%
Eval Score : 99.888%


C:\Users\your_id\Desktop\mlops_tutorial\venv\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function plot_roc_curve is deprecated; Function :func:`plot_roc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: :meth:`sklearn.metric.RocCurveDisplay.from_predictions` or :meth:`sklearn.metric.RocCurveDisplay.from_estimator`.
  warnings.warn(msg, category=FutureWarning)

png

주요 핵심 머신러닝 리뷰

강의 홍보

개요

  • 수강생들의 머신러닝을 활용한 웹 개발 프로젝트 전 복습 차원에서 준비함.
  • 주 내용은 주요 참고자료를 기반으로 작성하였으며, 참고자료에 없는 코드는 직접 작성하였음을 밝힘.

가장 인기 있는 모델

  • XGBoost와 LightGBM
  • 그 외, 선형회귀, 로지스틱 회귀, 결정 트리, 앙상블 학습, 랜덤 포레스트, XGBoost, LightGBM

선형 회귀

  • 선형 회귀식을 활용한 모델
  • 회귀 계수와 절편을 찾는 것이 중요
  • 기초통계에서 다루는 선형 회귀와 기본적인 개념에서는 동일하나, 기초통계에서와 예측 모델에서의 쓰임새는 다르다는 것을 기억한다.

데이터 생성

  • 단순 선형 회귀식 $y = 3x + 4$에 근사한 데이터 50개 생성
import numpy as np 
import pandas as pd 

np.random.seed(0) # 시드값 고정
intercept = 4 # 절편
slope = 3 # 기울기

# 변동성 주기 위해 노이즈 생성
noise = np.random.randn(50, 1)

# 50개의 x값 생성
x = 5 * np.random.rand(50, 1) # 0과 5사이의 실숫값 50개 생성
y = slope * x + intercept + noise

# 데이터 프레임 생성
data = pd.DataFrame({'X': x[:, 0], 'Y': y[:, 0]})
print(data.head())
          X          Y
0  0.794848   8.148596
1  0.551876   6.055784
2  3.281648  14.823682
3  0.690915   8.313637
4  0.982912   8.816293
  • 위 데이터를 시각화로 구현한다.
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.scatter(data['X'], data['Y'])
plt.show()

png

django Web 개발 - IRIS Prediction

개요

  • Python Django와 Sklearn을 활용하여 간단한 iris prediction 웹을 만들어본다.

사전준비

  • 머신러닝 기본 이론 및 원리는 어느정도 알고 있다고 가정한다.
  • Django 앱에 대해 어느정도 알고 있다고 가정한다.

무엇을 배우는가?

  • 머신러닝 모델을 활용하여 배포하는 과정을 배운다.

가상환경 설정

  • 가상환경을 생성한다.
$ virtualenv venv
created virtual environment CPython3.9.1.final.0-64 in 475ms
  creator CPython3Posix(dest=/Users/evan/Desktop/django-iris-tutorial/venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/evan/Library/Application Support/virtualenv)
    added seed packages: pip==22.1.1, setuptools==62.3.2, wheel==0.37.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
  • 만들어진 가상환경에 접속한다.
$ source venv/bin/activate
(venv) $
  • 크게 3개의 라이브러리를 설치한다.
    • jupyterlab : 머신러닝 개발 과정을 진행할 에디터로 활용한다.
    • sklearn : 머신러닝 개발 관련 라이브러리이다.
    • django : django 웹 프레임워크 라이브러리이다.
(venv) $ pip install jupyterlab sklearn django

머신러닝 개발

  • iris 데이터를 불러오고 sklearn을 활용하여 모형 개발을 진행한다.
  • 모형 개발 시, 주요 Feature Engineering 과정은 생략한다.

(1) 모형 개발

  • jupyterlab을 실행한다.
(venv) $ python -m jupyterlab
  • 아래와 같이 코드를 입력한다.
# Module 불러오기
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import pandas as pd

# 데이터셋 불러오기
df = pd.read_csv("data/iris.csv")

# 독립변수와 종속변수 분리
X = df[['sepal_length','sepal_width','petal_length','petal_width']]
y = df['classification']

# 훈련데이터와 종속데이터 분리
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# 모형 학습
model = SVC(gamma='auto')
model.fit(X_train, Y_train)

# 모형 예측
sepal_length = float(1.5)
sepal_width = float(5)
petal_length = float(4)
petal_width = float(3)

result = model.predict([[sepal_length,sepal_width,petal_length,petal_width]])  # input must be 2D array
print(result)
['Iris-virginica']
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/base.py:445: UserWarning: X does not have valid feature names, but SVC was fitted with feature names
  warnings.warn(

(2) 모형 저장

  • pickle을 통해 모형 저장을 할 수 있다.
pd.to_pickle(model, r'models/svc_model.pickle')

Django 시작

  • django 웹사이트 프로젝트를 시작한다.
(venv) $ django-admin startproject iris
(venv) $ cd iris
(venv) $ python manage.py startapp predict

settings.py

  • iris/settings.py를 열고 아래와 같이 수정한다.
.
.
# Application definition

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'predict', 
]
.
.

urls.py

  • iris/urls.py를 열고 아래와 같이 수정한다.
from django.contrib import admin
from django.urls import path, include

urlpatterns = [
    path('admin/', admin.site.urls),
    path('', include('predict.urls', namespace='predict'))
]
  • predict/urls.py를 새로 생성하고 아래와 같이 코드를 추가한다.
from django.urls import path
from . import views

app_name = 'predict'

urlpatterns = [
    path('', views.predict, name='predict'),
]

views.py

  • 이제 predict/views.py에서 predict 함수를 만들어 백엔드 처리를 진행하고, 최종 결괏값을 predict.html로 돌려주는 함수를 구현할 것이다.
from django.shortcuts import render

# Create your views here.
def predict(request):
    return render(request, 'predict.html', {})

predict.html

  • predict/templates 폴더를 만들고, predict.html 파일을 새로 생성한다.
hello
  • predict.html 파일이 잘 열리는지 확인한다.
(venv) iris$ python manage.py runserver 

django Web 개발 1 - 기본 설정

프로젝트 개요

  • Python Django로 To-do List를 만들어본다.

파이썬 설치

가상환경 설치

  • 파이썬 설치가 끝났다면, 해당 깃허브를 다운로드 받는다.
$ git clone https://github.com/dschloe/django_to_do.git
  • django_to_do 폴더에 진입 후 아래와 같이 실행한다.
$ cd django_to_do
$ virtualenv venv
created virtual environment CPython3.9.7.final.0-64 in 7884ms     
  creator CPython3Windows(dest=C:\Users\human\Desktop\django_to_do\venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=C:\Users\human\AppData\Local\pypa\virtualenv)
    added seed packages: pip==22.0.4, setuptools==62.1.0, wheel==0.37.1
  activators BashActivator,BatchActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
  • 실제 실행 후 venv 폴더가 나타나는지 확인한다.
$ ls
README.md  venv/
  • 가상환경에 접속한다.
$ source venv/Scripts/activate
(venv)

환경 설정 Django

Installing Django

$ pip install django pytz

my_app 폴더 생성

  • 새로운 폴더 my_app 하나 생성 후 해당 경로에 접속한다.
$ mkdir my_app && cd my_app

djanog 개발 시작

  • 이제 django 프로젝트를 시작한다.
  • django 설치 후, todo_app 폴더와 manage.py가 실행되는지 확인한다.
$ django-admin startproject todo_app .
$ ls
manage.py*  todo_app/

초기 화면 실행

  • my_app 폴더에서 서버를 구동시킨다.
    • 접속 URL : http://127.0.0.1:8000/
$ python manage.py runserver

PyCaret Kaggle Notebook (Since April 10, 2022)

개요

  • PyCaret이 최근 업데이트 되면서 Kaggle에서 설치 오류가 뜨기 시작함.
  • 해결책은 몇가지 있으나, 그 중 Downgrade 해서 설치 할 예정

캐글 대회 시작

  • 캐글 노트북 시작을 하면 다음 코드가 나타난다. 다음 Cell부터 진행한다.
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/tabular-playground-series-apr-2022/sample_submission.csv
/kaggle/input/tabular-playground-series-apr-2022/train_labels.csv
/kaggle/input/tabular-playground-series-apr-2022/train.csv
/kaggle/input/tabular-playground-series-apr-2022/test.csv

라이브러리 Downgrade

  • 설치하려고 하는 Library는 PyCaret 2.3.5 버전임 (4월 10일 기준 2.3.10 버전)
  • Scikit-Learn 최신 버전은 1.0대 버전임
  • 이를 낮춰서 진행할 것임
!pip install numpy==1.19.5
!pip install matplotlib==3.4.0
!pip install scikit-learn==0.23.2
!pip install pycaret==2.3.5
Collecting numpy==1.19.5
  Downloading numpy-1.19.5-cp37-cp37m-manylinux2010_x86_64.whl (14.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.8/14.8 MB 33.0 MB/s eta 0:00:00
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.5
    Uninstalling numpy-1.21.5:
      Successfully uninstalled numpy-1.21.5
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-io 0.21.0 requires tensorflow-io-gcs-filesystem==0.21.0, which is not installed.
beatrix-jupyterlab 3.1.7 requires google-cloud-bigquery-storage, which is not installed.
thinc 8.0.15 requires typing-extensions<4.0.0.0,>=3.7.4.1; python_version < "3.8", but you have typing-extensions 4.1.1 which is incompatible.
tfx-bsl 1.7.0 requires pyarrow<6,>=1, but you have pyarrow 7.0.0 which is incompatible.
tfx-bsl 1.7.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3,>=1.15.5, but you have tensorflow 2.6.3 which is incompatible.
tensorflow 2.6.3 requires absl-py~=0.10, but you have absl-py 1.0.0 which is incompatible.
tensorflow 2.6.3 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tensorflow 2.6.3 requires typing-extensions<3.11,>=3.7, but you have typing-extensions 4.1.1 which is incompatible.
tensorflow 2.6.3 requires wrapt~=1.12.1, but you have wrapt 1.14.0 which is incompatible.
tensorflow-transform 1.7.0 requires pyarrow<6,>=1, but you have pyarrow 7.0.0 which is incompatible.
tensorflow-transform 1.7.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<2.9,>=1.15.5, but you have tensorflow 2.6.3 which is incompatible.
tensorflow-serving-api 2.8.0 requires tensorflow<3,>=2.8.0, but you have tensorflow 2.6.3 which is incompatible.
spacy 3.2.4 requires typing-extensions<4.0.0.0,>=3.7.4; python_version < "3.8", but you have typing-extensions 4.1.1 which is incompatible.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.5.1 which is incompatible.
imageio 2.16.1 requires numpy>=1.20.0, but you have numpy 1.19.5 which is incompatible.
featuretools 1.8.0 requires numpy>=1.21.0, but you have numpy 1.19.5 which is incompatible.
apache-beam 2.37.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.4 which is incompatible.
apache-beam 2.37.0 requires httplib2<0.20.0,>=0.8, but you have httplib2 0.20.4 which is incompatible.
apache-beam 2.37.0 requires pyarrow<7.0.0,>=0.15.1, but you have pyarrow 7.0.0 which is incompatible.
Successfully installed numpy-1.19.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting matplotlib==3.4.0
  Downloading matplotlib-3.4.0-cp37-cp37m-manylinux1_x86_64.whl (10.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.3/10.3 MB 21.5 MB/s eta 0:00:00
[?25hRequirement already satisfied: pyparsing>=2.2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib==3.4.0) (3.0.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib==3.4.0) (1.4.0)
Requirement already satisfied: numpy>=1.16 in /opt/conda/lib/python3.7/site-packages (from matplotlib==3.4.0) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.7/site-packages (from matplotlib==3.4.0) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib==3.4.0) (0.11.0)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.7/site-packages (from matplotlib==3.4.0) (9.0.1)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib==3.4.0) (4.1.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7->matplotlib==3.4.0) (1.16.0)
Installing collected packages: matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.5.1
    Uninstalling matplotlib-3.5.1:
      Successfully uninstalled matplotlib-3.5.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
beatrix-jupyterlab 3.1.7 requires google-cloud-bigquery-storage, which is not installed.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.0 which is incompatible.
Successfully installed matplotlib-3.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.8/6.8 MB 14.8 MB/s eta 0:00:00
[?25hRequirement already satisfied: scipy>=0.19.1 in /opt/conda/lib/python3.7/site-packages (from scikit-learn==0.23.2) (1.7.3)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn==0.23.2) (1.0.1)
Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from scikit-learn==0.23.2) (1.19.5)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn==0.23.2) (3.1.0)
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.4 requires scikit-learn>=1.0.0, but you have scikit-learn 0.23.2 which is incompatible.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.0 which is incompatible.
imbalanced-learn 0.9.0 requires scikit-learn>=1.0.1, but you have scikit-learn 0.23.2 which is incompatible.
hypertools 0.8.0 requires scikit-learn>=0.24, but you have scikit-learn 0.23.2 which is incompatible.
featuretools 1.8.0 requires numpy>=1.21.0, but you have numpy 1.19.5 which is incompatible.
Successfully installed scikit-learn-0.23.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting pycaret==2.3.5
  Downloading pycaret-2.3.5-py3-none-any.whl (288 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 288.6/288.6 KB 1.6 MB/s eta 0:00:00
[?25hRequirement already satisfied: ipywidgets in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (7.6.5)
Requirement already satisfied: pyLDAvis in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (3.2.2)
Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (1.3.5)
Collecting scipy<=1.5.4
  Downloading scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.9/25.9 MB 30.0 MB/s eta 0:00:00
[?25hRequirement already satisfied: nltk in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (3.2.4)
Requirement already satisfied: IPython in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (7.32.0)
Collecting spacy<2.4.0
  Downloading spacy-2.3.7-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 41.9 MB/s eta 0:00:00
[?25hRequirement already satisfied: lightgbm>=2.3.1 in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (3.3.1)
Requirement already satisfied: yellowbrick>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (1.4)
Requirement already satisfied: textblob in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (0.17.1)
Requirement already satisfied: plotly>=4.4.1 in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (5.7.0)
Requirement already satisfied: scikit-plot in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (0.3.7)
Requirement already satisfied: umap-learn in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (0.5.2)
Collecting gensim<4.0.0
  Downloading gensim-3.8.3-cp37-cp37m-manylinux1_x86_64.whl (24.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.2/24.2 MB 28.5 MB/s eta 0:00:00
[?25hRequirement already satisfied: Boruta in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (0.3)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (3.4.0)
Requirement already satisfied: pandas-profiling>=2.8.0 in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (3.1.0)
Collecting imbalanced-learn==0.7.0
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 167.1/167.1 KB 14.8 MB/s eta 0:00:00
[?25hCollecting mlflow
  Downloading mlflow-1.25.1-py3-none-any.whl (16.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.8/16.8 MB 33.2 MB/s eta 0:00:00
[?25hCollecting pyod
  Downloading pyod-0.9.9.tar.gz (116 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.4/116.4 KB 10.8 MB/s eta 0:00:00
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hRequirement already satisfied: kmodes>=0.10.1 in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (0.12.0)
Requirement already satisfied: wordcloud in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (1.8.1)
Requirement already satisfied: joblib in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (1.0.1)
Requirement already satisfied: numpy==1.19.5 in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (1.19.5)
Requirement already satisfied: seaborn in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (0.11.2)
Requirement already satisfied: mlxtend>=0.17.0 in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (0.19.0)
Requirement already satisfied: scikit-learn==0.23.2 in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (0.23.2)
Requirement already satisfied: cufflinks>=0.17.0 in /opt/conda/lib/python3.7/site-packages (from pycaret==2.3.5) (0.17.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn==0.23.2->pycaret==2.3.5) (3.1.0)
Requirement already satisfied: setuptools>=34.4.1 in /opt/conda/lib/python3.7/site-packages (from cufflinks>=0.17.0->pycaret==2.3.5) (59.8.0)
Requirement already satisfied: colorlover>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from cufflinks>=0.17.0->pycaret==2.3.5) (0.3.0)
Requirement already satisfied: six>=1.9.0 in /opt/conda/lib/python3.7/site-packages (from cufflinks>=0.17.0->pycaret==2.3.5) (1.16.0)
Requirement already satisfied: smart-open>=1.8.1 in /opt/conda/lib/python3.7/site-packages (from gensim<4.0.0->pycaret==2.3.5) (5.2.1)
Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret==2.3.5) (0.7.5)
Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret==2.3.5) (0.2.0)
Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret==2.3.5) (5.1.1)
Requirement already satisfied: pexpect>4.3 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret==2.3.5) (4.8.0)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret==2.3.5) (3.0.27)
Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret==2.3.5) (2.11.2)
Requirement already satisfied: jedi>=0.16 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret==2.3.5) (0.18.1)
Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret==2.3.5) (5.1.1)
Requirement already satisfied: matplotlib-inline in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret==2.3.5) (0.1.3)
Requirement already satisfied: ipython-genutils~=0.2.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret==2.3.5) (0.2.0)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret==2.3.5) (1.0.2)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret==2.3.5) (3.5.2)
Requirement already satisfied: nbformat>=4.2.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret==2.3.5) (5.2.0)
Requirement already satisfied: ipykernel>=4.5.1 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret==2.3.5) (6.9.2)
Requirement already satisfied: wheel in /opt/conda/lib/python3.7/site-packages (from lightgbm>=2.3.1->pycaret==2.3.5) (0.37.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret==2.3.5) (9.0.1)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret==2.3.5) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret==2.3.5) (1.4.0)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret==2.3.5) (3.0.7)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret==2.3.5) (0.11.0)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas->pycaret==2.3.5) (2021.3)
Requirement already satisfied: visions[type_image_path]==0.7.4 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (0.7.4)
Requirement already satisfied: jinja2>=2.11.1 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (3.1.1)
Requirement already satisfied: phik>=0.11.1 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (0.12.0)
Requirement already satisfied: requests>=2.24.0 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (2.27.1)
Requirement already satisfied: pydantic>=1.8.1 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (1.8.2)
Requirement already satisfied: markupsafe~=2.0.1 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (2.0.1)
Requirement already satisfied: htmlmin>=0.1.12 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (0.1.12)
Requirement already satisfied: tqdm>=4.48.2 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (4.63.0)
Requirement already satisfied: multimethod>=1.4 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (1.4)
Requirement already satisfied: PyYAML>=5.0.0 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (6.0)
Requirement already satisfied: tangled-up-in-unicode==0.1.0 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (0.1.0)
Requirement already satisfied: missingno>=0.4.2 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret==2.3.5) (0.4.2)
Requirement already satisfied: attrs>=19.3.0 in /opt/conda/lib/python3.7/site-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret==2.3.5) (21.4.0)
Requirement already satisfied: networkx>=2.4 in /opt/conda/lib/python3.7/site-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret==2.3.5) (2.5)
Requirement already satisfied: imagehash in /opt/conda/lib/python3.7/site-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret==2.3.5) (4.2.1)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.7/site-packages (from plotly>=4.4.1->pycaret==2.3.5) (8.0.1)
Collecting catalogue<1.1.0,>=0.0.7
  Downloading catalogue-1.0.0-py2.py3-none-any.whl (7.7 kB)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from spacy<2.4.0->pycaret==2.3.5) (0.9.1)
Collecting thinc<7.5.0,>=7.4.1
  Downloading thinc-7.4.5-cp37-cp37m-manylinux2014_x86_64.whl (1.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 49.3 MB/s eta 0:00:00
[?25hRequirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/conda/lib/python3.7/site-packages (from spacy<2.4.0->pycaret==2.3.5) (1.0.6)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from spacy<2.4.0->pycaret==2.3.5) (0.7.7)
Collecting plac<1.2.0,>=0.9.6
  Downloading plac-1.1.3-py2.py3-none-any.whl (20 kB)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy<2.4.0->pycaret==2.3.5) (2.0.6)
Collecting srsly<1.1.0,>=1.0.2
  Downloading srsly-1.0.5-cp37-cp37m-manylinux2014_x86_64.whl (184 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 184.3/184.3 KB 16.8 MB/s eta 0:00:00
[?25hRequirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy<2.4.0->pycaret==2.3.5) (3.0.6)
Collecting yellowbrick>=1.0.1
  Downloading yellowbrick-1.3.post1-py3-none-any.whl (271 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 271.4/271.4 KB 22.2 MB/s eta 0:00:00
[?25hRequirement already satisfied: alembic in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (1.7.7)
Requirement already satisfied: Flask in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (2.1.1)
Collecting gunicorn
  Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.5/79.5 KB 7.9 MB/s eta 0:00:00
[?25hRequirement already satisfied: importlib-metadata!=4.7.0,>=3.7.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (4.11.3)
Collecting querystring-parser
  Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Requirement already satisfied: docker>=4.0.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (5.0.3)
Requirement already satisfied: sqlparse>=0.3.1 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (0.4.2)
Requirement already satisfied: gitpython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (3.1.27)
Requirement already satisfied: protobuf>=3.7.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (3.19.4)
Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (21.3)
Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (8.0.4)
Collecting prometheus-flask-exporter
  Downloading prometheus_flask_exporter-0.20.1-py3-none-any.whl (18 kB)
Requirement already satisfied: entrypoints in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (0.4)
Collecting databricks-cli>=0.8.7
  Downloading databricks-cli-0.16.6.tar.gz (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.2/62.2 KB 5.8 MB/s eta 0:00:00
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hRequirement already satisfied: cloudpickle in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (2.0.0)
Requirement already satisfied: sqlalchemy in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret==2.3.5) (1.4.32)
Requirement already satisfied: future in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret==2.3.5) (0.18.2)
Requirement already satisfied: funcy in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret==2.3.5) (1.17)
Requirement already satisfied: numexpr in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret==2.3.5) (2.8.1)
Requirement already satisfied: numba>=0.35 in /opt/conda/lib/python3.7/site-packages (from pyod->pycaret==2.3.5) (0.55.1)
Requirement already satisfied: statsmodels in /opt/conda/lib/python3.7/site-packages (from pyod->pycaret==2.3.5) (0.13.2)
Requirement already satisfied: pynndescent>=0.5 in /opt/conda/lib/python3.7/site-packages (from umap-learn->pycaret==2.3.5) (0.5.6)
Requirement already satisfied: pyjwt>=1.7.0 in /opt/conda/lib/python3.7/site-packages (from databricks-cli>=0.8.7->mlflow->pycaret==2.3.5) (2.3.0)
Requirement already satisfied: oauthlib>=3.1.0 in /opt/conda/lib/python3.7/site-packages (from databricks-cli>=0.8.7->mlflow->pycaret==2.3.5) (3.2.0)
Requirement already satisfied: tabulate>=0.7.7 in /opt/conda/lib/python3.7/site-packages (from databricks-cli>=0.8.7->mlflow->pycaret==2.3.5) (0.8.9)
Requirement already satisfied: websocket-client>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from docker>=4.0.0->mlflow->pycaret==2.3.5) (1.3.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.7/site-packages (from gitpython>=2.1.0->mlflow->pycaret==2.3.5) (4.1.1)
Requirement already satisfied: gitdb<5,>=4.0.1 in /opt/conda/lib/python3.7/site-packages (from gitpython>=2.1.0->mlflow->pycaret==2.3.5) (4.0.9)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata!=4.7.0,>=3.7.0->mlflow->pycaret==2.3.5) (3.7.0)
Requirement already satisfied: psutil in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret==2.3.5) (5.9.0)
Requirement already satisfied: debugpy<2.0,>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret==2.3.5) (1.5.1)
Requirement already satisfied: tornado<7.0,>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret==2.3.5) (6.1)
Requirement already satisfied: jupyter-client<8.0 in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret==2.3.5) (7.1.2)
Requirement already satisfied: nest-asyncio in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret==2.3.5) (1.5.4)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.16->IPython->pycaret==2.3.5) (0.8.3)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /opt/conda/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->pycaret==2.3.5) (4.4.0)
Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->pycaret==2.3.5) (4.9.2)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in /opt/conda/lib/python3.7/site-packages (from numba>=0.35->pyod->pycaret==2.3.5) (0.38.0)
Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect>4.3->IPython->pycaret==2.3.5) (0.7.0)
Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->IPython->pycaret==2.3.5) (0.2.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret==2.3.5) (1.26.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret==2.3.5) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.7/site-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret==2.3.5) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.24.0->pandas-profiling>=2.8.0->pycaret==2.3.5) (3.3)
Requirement already satisfied: notebook>=4.4.1 in /opt/conda/lib/python3.7/site-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (6.4.10)
Requirement already satisfied: importlib-resources in /opt/conda/lib/python3.7/site-packages (from alembic->mlflow->pycaret==2.3.5) (5.4.0)
Requirement already satisfied: Mako in /opt/conda/lib/python3.7/site-packages (from alembic->mlflow->pycaret==2.3.5) (1.2.0)
Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.7/site-packages (from sqlalchemy->mlflow->pycaret==2.3.5) (1.1.2)
Requirement already satisfied: Werkzeug>=2.0 in /opt/conda/lib/python3.7/site-packages (from Flask->mlflow->pycaret==2.3.5) (2.0.3)
Requirement already satisfied: itsdangerous>=2.0 in /opt/conda/lib/python3.7/site-packages (from Flask->mlflow->pycaret==2.3.5) (2.1.2)
Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from prometheus-flask-exporter->mlflow->pycaret==2.3.5) (0.13.1)
Requirement already satisfied: patsy>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from statsmodels->pyod->pycaret==2.3.5) (0.5.2)
Requirement already satisfied: smmap<6,>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from gitdb<5,>=4.0.1->gitpython>=2.1.0->mlflow->pycaret==2.3.5) (3.0.5)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->pycaret==2.3.5) (0.18.1)
Requirement already satisfied: pyzmq>=13 in /opt/conda/lib/python3.7/site-packages (from jupyter-client<8.0->ipykernel>=4.5.1->ipywidgets->pycaret==2.3.5) (22.3.0)
Requirement already satisfied: terminado>=0.8.3 in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (0.13.3)
Requirement already satisfied: argon2-cffi in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (21.3.0)
Requirement already satisfied: Send2Trash>=1.8.0 in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (1.8.0)
Requirement already satisfied: nbconvert>=5 in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (6.5.0)
Requirement already satisfied: PyWavelets in /opt/conda/lib/python3.7/site-packages (from imagehash->visions[type_image_path]==0.7.4->pandas-profiling>=2.8.0->pycaret==2.3.5) (1.3.0)
Requirement already satisfied: jupyterlab-pygments in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (0.1.2)
Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (0.8.4)
Requirement already satisfied: nbclient>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (0.5.13)
Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (0.7.1)
Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (1.5.0)
Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (4.1.0)
Requirement already satisfied: tinycss2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (1.1.1)
Requirement already satisfied: beautifulsoup4 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (4.10.0)
Requirement already satisfied: argon2-cffi-bindings in /opt/conda/lib/python3.7/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (21.2.0)
Requirement already satisfied: cffi>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (1.15.0)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.7/site-packages (from beautifulsoup4->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (2.3.1)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (0.5.1)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret==2.3.5) (2.21)
Building wheels for collected packages: pyod, databricks-cli
  Building wheel for pyod (setup.py) ... [?25ldone
[?25h  Created wheel for pyod: filename=pyod-0.9.9-py3-none-any.whl size=139325 sha256=d0a36e8fd0573bc188a9e8f06c62d07c3ca5c12d5e2466310139677ba7731700
  Stored in directory: /root/.cache/pip/wheels/68/32/f0/0dc3050775e77b6661a116b70817b02b4305fa253269d6d998
  Building wheel for databricks-cli (setup.py) ... [?25ldone
[?25h  Created wheel for databricks-cli: filename=databricks_cli-0.16.6-py3-none-any.whl size=112631 sha256=3ae75cdf238ec349a4cf775a5fd33709acb1b17e8f1c1265eb3de4fe7c88fa22
  Stored in directory: /root/.cache/pip/wheels/96/c1/f8/d75a22e789ab6a4dff11f18338c3af4360189aa371295cc934
Successfully built pyod databricks-cli
Installing collected packages: srsly, plac, scipy, querystring-parser, gunicorn, gensim, catalogue, yellowbrick, thinc, imbalanced-learn, databricks-cli, spacy, pyod, prometheus-flask-exporter, mlflow, pycaret
  Attempting uninstall: srsly
    Found existing installation: srsly 2.4.2
    Uninstalling srsly-2.4.2:
      Successfully uninstalled srsly-2.4.2
  Attempting uninstall: scipy
    Found existing installation: scipy 1.7.3
    Uninstalling scipy-1.7.3:
      Successfully uninstalled scipy-1.7.3
  Attempting uninstall: gensim
    Found existing installation: gensim 4.0.1
    Uninstalling gensim-4.0.1:
      Successfully uninstalled gensim-4.0.1
  Attempting uninstall: catalogue
    Found existing installation: catalogue 2.0.7
    Uninstalling catalogue-2.0.7:
      Successfully uninstalled catalogue-2.0.7
  Attempting uninstall: yellowbrick
    Found existing installation: yellowbrick 1.4
    Uninstalling yellowbrick-1.4:
      Successfully uninstalled yellowbrick-1.4
  Attempting uninstall: thinc
    Found existing installation: thinc 8.0.15
    Uninstalling thinc-8.0.15:
      Successfully uninstalled thinc-8.0.15
  Attempting uninstall: imbalanced-learn
    Found existing installation: imbalanced-learn 0.9.0
    Uninstalling imbalanced-learn-0.9.0:
      Successfully uninstalled imbalanced-learn-0.9.0
  Attempting uninstall: spacy
    Found existing installation: spacy 3.2.4
    Uninstalling spacy-3.2.4:
      Successfully uninstalled spacy-3.2.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scattertext 0.1.6 requires gensim>=4.0.0, but you have gensim 3.8.3 which is incompatible.
pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but you have scipy 1.5.4 which is incompatible.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.0 which is incompatible.
hypertools 0.8.0 requires scikit-learn>=0.24, but you have scikit-learn 0.23.2 which is incompatible.
featuretools 1.8.0 requires numpy>=1.21.0, but you have numpy 1.19.5 which is incompatible.
en-core-web-sm 3.2.0 requires spacy<3.3.0,>=3.2.0, but you have spacy 2.3.7 which is incompatible.
en-core-web-lg 3.2.0 requires spacy<3.3.0,>=3.2.0, but you have spacy 2.3.7 which is incompatible.
Successfully installed catalogue-1.0.0 databricks-cli-0.16.6 gensim-3.8.3 gunicorn-20.1.0 imbalanced-learn-0.7.0 mlflow-1.25.1 plac-1.1.3 prometheus-flask-exporter-0.20.1 pycaret-2.3.5 pyod-0.9.9 querystring-parser-1.2.4 scipy-1.5.4 spacy-2.3.7 srsly-1.0.5 thinc-7.4.5 yellowbrick-1.3.post1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv


테스트

  • 분류를 위한 필수 라이브러리를 가져왔다.
from pycaret.classification import *
import warnings
warnings.filterwarnings("ignore")
  • 데이터를 불러오는 코드다.
from pycaret.datasets import get_data
data = get_data('diabetes')

png

Airflow 데이터 파이프라인 구축 예제

개요

  • 이번에는 CSV-JSON으로 데이터를 변환하는 파이프라인을 구축하도록 한다.

Step 01. Dags 폴더 생성

  • 프로젝트 Root 하단에 Dags 폴더를 만든다.
    • dags 폴더를 확인한다.
$ ls
airflow.cfg  airflow.db  dags  logs  venv  webserver_config.py

Step 02. 가상의 데이터 생성

  • 이번 테스트에서 사용할 라이브러리가 없다면 우선 설치한다.
$ pip3 install faker pandas
  • faker 라이브러리를 활용하여 가상의 데이터를 생성한다. (파일 경로 : data/step01_writecsv.py)
from faker import Faker
import csv
output=open('data.csv','w')
fake=Faker()
header=['name','age','street','city','state','zip','lng','lat']
mywriter=csv.writer(output)
mywriter.writerow(header)
for r in range(1000):
    mywriter.writerow([fake.name(),fake.random_int(min=18, max=80, step=1), fake.street_address(), fake.city(),fake.state(),fake.zipcode(),fake.longitude(),fake.latitude()])
output.close()
  • 생성된 후, 파일을 확인하도록 한다.
evan@evan:/mnt/c/airflow-test/data$ ls
data.csv  step01_writecsv.py

Step 03. csv2json 파일 구축

  • 이번에는 CSV와 JSON 변환 파일을 구축하는 코드를 작성한다. (파일 경로 : dags/csv2json.py)\
  • 주요 목적 함수 csvToJson()의 역할은 data/data.csv 파일을 불러와서 fromAirflow.json 파일로 변경하는 것이다.
  • DAG는 csvToJson 함수를 하나의 작업으로 등록하는 과정을 담는다. 작업의 소유자, 시작일시, 실패 시 재시도 횟수, 재시도 지연시 시간을 지정한다.
  • print_starting >> csvJson 에서 >> 는 하류 설정 연산자라고 부른다. (동의어 비트 자리이동 연산자)
import datetime as dt
from datetime import timedelta

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

import pandas as pd

def csvToJson():
    df=pd.read_csv('data/data.csv')
    for i,r in df.iterrows():
        print(r['name'])
    df.to_json('fromAirflow.json',orient='records')

default_args = {
    'owner': 'evan',
    'start_date': dt.datetime(2020, 3, 18),
    'retries': 1,
    'retry_delay': dt.timedelta(minutes=5),
}

with DAG('MyCSVDAG',
         default_args=default_args,
         schedule_interval=timedelta(minutes=5),      # '0 * * * *',
         ) as dag:

    print_starting = BashOperator(task_id='starting',
                               bash_command='echo "I am reading the CSV now....."')
    
    csvJson = PythonOperator(task_id='convertCSVtoJson',
                                 python_callable=csvToJson)

print_starting >> csvJson

Step 04. Airflow Webserver 및 Scheduler 동시 실행

  • 이제 웹서버와 스케쥴러를 동시에 실행한다. (터미널을 2개 열어야 함에 주의한다.)
$ airflow webserver -p 8080
$ airflow scheduler
  • 이제 WebUI를 확인하면 정상적으로 작동하는 것을 확인할 수 있다.

Untitled

Scikit-Learn OneHot Encoding 다양한 적용 방법

강의 홍보

개요

  • One-Hot Encoding 개념에 대해 이해한다.
  • One-Hot Encoder 사용법을 익힌다.

One-Hot Encoding

  • One-Hot Encoding은 문자를 숫자로 변환하는 것이다.
  • 먼저 그림을 보면서 이해하도록 한다.

  • 머신러닝 알고리즘은 데이터가 모두 숫자인 것으로 이해하기 때문에 모두 변환해주어야 한다.

OnetHotEncoder

  • OneHotEncoder는 Scikit-Learn 라이브러리에 있는 클래스이다.
    • 자세한 내용은 링크를 참조한다.
  • 먼저 예시를 참조한다.
import sklearn
print("sklearn ver.", sklearn.__version__)
sklearn ver. 1.0.2
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit_transform(X).toarray()
array([[0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.]])
  • 예시 코드를 보면 위 그림과 결괏값이 다르게 나오는 걸 확인할 수 있다.
  • 보통 우리가 다루는 데이터는 pandas 데이터프레임이기 때문에, 입문자분들에게는 거리감이 느껴질 수 있다.
  • 그래서 pandas 데이터프레임 데이터를 가져와서 테스트를 해보았다.
from sklearn.preprocessing import OneHotEncoder
from seaborn import load_dataset

penguins = load_dataset('penguins')
ohe = OneHotEncoder()
transformed = ohe.fit_transform(penguins[['island']])
print(transformed.toarray())
print(ohe.categories_)
print(penguins['island'].unique())
[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 ...
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]
[array(['Biscoe', 'Dream', 'Torgersen'], dtype=object)]
['Torgersen' 'Biscoe' 'Dream']
  • 이제 해당 코드를 기존 데이터프레임에 추가하도록 한다.
print(penguins.head())
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female  
penguins[ohe.categories_[0]] = transformed.toarray()
print(penguins.head())
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  Biscoe  Dream  Torgersen  
0       3750.0    Male     0.0    0.0        1.0  
1       3800.0  Female     0.0    0.0        1.0  
2       3250.0  Female     0.0    0.0        1.0  
3          NaN     NaN     0.0    0.0        1.0  
4       3450.0  Female     0.0    0.0        1.0  

만약 다중 문자열 컬럼을 한다면?

  • 위 예시는 변경하려는 컬럼이 1개일 때는 시의적절하게 사용할 수 있다.
  • 그러나, 보통 캐글이나 데이콘 같은 대회에서는 여러개의 문자열 컬럼을 변환시켜야 한다.
  • 물론, 프로그래밍 능력을 갖춘 분이라면, 반복문을 사용해서 처리할 수도 있다.
  • 그러나, sklearn.compose.make_column_transformer 클래스를 활용하면 보다 쉽게 처리할 수 있다.
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import make_column_transformer
from seaborn import load_dataset
import pandas as pd

penguins = load_dataset('penguins')
sample_cols = ['island', 'sex', 'bill_length_mm', 'species']
penguins = penguins[sample_cols]

# 결측치 제거 
penguins = penguins.dropna()
print(penguins.head())
print(penguins.info())
      island     sex  bill_length_mm species
0  Torgersen    Male            39.1  Adelie
1  Torgersen  Female            39.5  Adelie
2  Torgersen  Female            40.3  Adelie
4  Torgersen  Female            36.7  Adelie
5  Torgersen    Male            39.3  Adelie
<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   island          333 non-null    object 
 1   sex             333 non-null    object 
 2   bill_length_mm  333 non-null    float64
 3   species         333 non-null    object 
dtypes: float64(1), object(3)
memory usage: 13.0+ KB
None
categorical_cols = ['island', 'sex']
label_cols = ['species']

transformer = make_column_transformer(
    (OneHotEncoder(), categorical_cols),
    remainder='passthrough', 
    verbose_feature_names_out = False)

transformed = transformer.fit_transform(penguins)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
print(transformed_df.head())
  island_Biscoe island_Dream island_Torgersen sex_Female sex_Male  \
0           0.0          0.0              1.0        0.0      1.0   
1           0.0          0.0              1.0        1.0      0.0   
2           0.0          0.0              1.0        1.0      0.0   
3           0.0          0.0              1.0        1.0      0.0   
4           0.0          0.0              1.0        0.0      1.0   

  bill_length_mm species  
0           39.1  Adelie  
1           39.5  Adelie  
2           40.3  Adelie  
3           36.7  Adelie  
4           39.3  Adelie  

OrdinalEncoder 클래스와 같이 사용이 가능한가?

  • 이번에는 OrdinalEncoder 클래스와 같이 사용을 하도록 한다.
import pandas as pd
from seaborn import load_dataset

tips = load_dataset('tips')

# 결측치 제거 
tips = tips.dropna()
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 9.1 KB
None
  • 위 데이터에서 sex, day는 onehot encoding을 진행하고, smoker와 time은 ordinal encoding을 동시 진행해본다.
  • 또한, numeric features를 위해 스케일러도 진행했다.
  • 그 후, 새로운 데이터 프레임으로 변환하는 코드를 작성한다.
  • ColumnTransformer 메서드 적용 후, get_feature_names()를 얻기 위해서는 helper 함수가 필요하다.
    • 함수는 해당 링크에서 가져왔다.
import warnings
import sklearn
import pandas as pd
import numpy as np

def get_feature_names(column_transformer):
    """Get feature names from all transformers.
    Returns
    -------
    feature_names : list of strings
        Names of the features produced by transform.
    """
    # Remove the internal helper function
    #check_is_fitted(column_transformer)
    
    # Turn loopkup into function for better handling with pipeline later
    def get_names(trans):
        # >> Original get_feature_names() method
        if trans == 'drop' or (
                hasattr(column, '__len__') and not len(column)):
            return []
        if trans == 'passthrough':
            if hasattr(column_transformer, '_df_columns'):
                if ((not isinstance(column, slice))
                        and all(isinstance(col, str) for col in column)):
                    return column
                else:
                    return column_transformer._df_columns[column]
            else:
                indices = np.arange(column_transformer._n_features)
                return ['x%d' % i for i in indices[column]]
        if not hasattr(trans, 'get_feature_names'):
        # >>> Change: Return input column names if no method avaiable
            # Turn error into a warning
            warnings.warn("Transformer %s (type %s) does not "
                                 "provide get_feature_names. "
                                 "Will return input column names if available"
                                 % (str(name), type(trans).__name__))
            # For transformers without a get_features_names method, use the input
            # names to the column transformer
            if column is None:
                return []
            else:
                return [name + "__" + f for f in column]

        return [name + "__" + f for f in trans.get_feature_names()]
    
    ### Start of processing
    feature_names = []
    
    # Allow transformers to be pipelines. Pipeline steps are named differently, so preprocessing is needed
    if type(column_transformer) == sklearn.pipeline.Pipeline:
        l_transformers = [(name, trans, None, None) for step, name, trans in column_transformer._iter()]
    else:
        # For column transformers, follow the original method
        l_transformers = list(column_transformer._iter(fitted=True))
    
    
    for name, trans, column, _ in l_transformers: 
        if type(trans) == sklearn.pipeline.Pipeline:
            # Recursive call on pipeline
            _names = get_feature_names(trans)
            # if pipeline has no transformer that returns names
            if len(_names)==0:
                _names = [name + "__" + f for f in column]
            feature_names.extend(_names)
        else:
            feature_names.extend(get_names(trans))
    
    return feature_names
  • 이제 위 함수들을 적용해서 각 인코딩과 사용하지 않는 컬럼들을 하나로 합치는 코드를 작성해본다.
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

categorical_cols = ['sex', 'day']
ordinal_cols = ['smoker', 'time']
numeric_cols = ['total_bill']
keep_features = [x for x in tips.columns if x not in categorical_cols + ordinal_cols + numeric_cols]

tips2 = tips[categorical_cols + ordinal_cols + numeric_cols]

transformer = ColumnTransformer(
    [('StandardScaler', StandardScaler(), numeric_cols),
     ('OneHotEncoder', OneHotEncoder(), categorical_cols),
     ('OrdinalEncoder', OrdinalEncoder(), ordinal_cols)],
    remainder='passthrough', 
    verbose_feature_names_out = False)

transformed = transformer.fit_transform(tips2)
transformed_df = pd.DataFrame(transformed, columns=get_feature_names(transformer))
tip3 = pd.concat([tips[keep_features], transformed_df], axis = 1)
tip3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tip                         244 non-null    float64
 1   size                        244 non-null    int64  
 2   StandardScaler__total_bill  244 non-null    float64
 3   OneHotEncoder__x0_Female    244 non-null    float64
 4   OneHotEncoder__x0_Male      244 non-null    float64
 5   OneHotEncoder__x1_Fri       244 non-null    float64
 6   OneHotEncoder__x1_Sat       244 non-null    float64
 7   OneHotEncoder__x1_Sun       244 non-null    float64
 8   OneHotEncoder__x1_Thur      244 non-null    float64
 9   OrdinalEncoder__smoker      244 non-null    float64
 10  OrdinalEncoder__time        244 non-null    float64
dtypes: float64(10), int64(1)
memory usage: 22.9 KB


/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:38: UserWarning: Transformer StandardScaler (type StandardScaler) does not provide get_feature_names. Will return input column names if available
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:38: UserWarning: Transformer OrdinalEncoder (type OrdinalEncoder) does not provide get_feature_names. Will return input column names if available
  • 일단 임시로 작업을 하기는 했으나, 뭔가 깔끔해보이지는 않는다.
  • 만약 작업을 한다면, 한꺼번에 하지 말고, 각 단계별로 pipeline을 구성 후, 순차적으로 하는 것이 현재로써는 좀 더 “정신건강상 좋아보인다!”

Plot Tree 색상 변경

개요

  • skleran.tree.plot_tree의 색상을 바꿔보도록 한다.
  • matplotlib 객체지향의 구조를 알면 어렵지(?) 않게 바꿀 수 있다.
  • 간단하게 plot_tree 시각화를 구현해본다.
    • 언제나 예제로 희생당하는 iris 데이터에게 애도를 표한다.
  • 구글코랩에서 실행 시, 다음 코드를 실행하여 최신 라이브러리로 업그레이드 한다.
!pip install -U matplotlib
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (3.2.2)
Collecting matplotlib
  Downloading matplotlib-3.5.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
     |████████████████████████████████| 11.2 MB 27.0 MB/s 
[?25hRequirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.4.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.21.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (7.1.2)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (3.0.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (21.3)
Collecting fonttools>=4.22.0
  Downloading fonttools-4.31.2-py3-none-any.whl (899 kB)
     |████████████████████████████████| 899 kB 50.5 MB/s 
[?25hRequirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib) (3.10.0.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7->matplotlib) (1.15.0)
Installing collected packages: fonttools, matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.2.2
    Uninstalling matplotlib-3.2.2:
      Successfully uninstalled matplotlib-3.2.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed fonttools-4.31.2 matplotlib-3.5.1
%matplotlib inline 

import sklearn
print(sklearn.__version__)
import matplotlib
print(matplotlib.__version__)

# 필수 라이브러리 불러오기
from sklearn.datasets import load_iris
from sklearn import tree 
import matplotlib.pyplot as plt

# 데이터 불러오기
iris = load_iris()
print(iris.data.shape, iris.target.shape)
print("feature names", iris.feature_names)
print("class names", iris.target_names)

# 모형 학습 및 plot_tree 그래프 구현
dt = tree.DecisionTreeClassifier(random_state=0)
dt.fit(iris.data, iris.target)

fig, ax = plt.subplots(figsize=(10, 6))
ax = tree.plot_tree(dt, max_depth = 2, filled=True, feature_names = iris.feature_names, class_names = iris.target_names)
plt.show()
1.0.2
3.5.1
(150, 4) (150,)
feature names ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
class names ['setosa' 'versicolor' 'virginica']

png