개요
- Scikit-Learn의 Pipeline은 강력하다.
- PyCaret, Skorch에도 사용이 가능하다.
- Google Colab에서 시도해보자.
필수 라이브러리 설치
- pycaret을 설치 한 후에는 반드시 런타임 재시작을 클릭한다.
Collecting pycaret
Downloading pycaret-2.3.5-py3-none-any.whl (288 kB)
.
.
Successfully installed Boruta-0.3 Mako-1.1.6 PyYAML-6.0 alembic-1.4.1 databricks-cli-0.16.2 docker-5.0.3 funcy-1.17 gitdb-4.0.9 gitpython-3.1.24 gunicorn-20.1.0 htmlmin-0.1.12 imagehash-4.2.1 imbalanced-learn-0.7.0 joblib-1.0.1 kmodes-0.11.1 lightgbm-3.3.1 mlflow-1.22.0 mlxtend-0.19.0 multimethod-1.6 pandas-profiling-3.1.0 phik-0.12.0 prometheus-flask-exporter-0.18.7 pyLDAvis-3.2.2 pycaret-2.3.5 pydantic-1.8.2 pynndescent-0.5.5 pyod-0.9.6 python-editor-1.0.4 querystring-parser-1.2.4 requests-2.26.0 scikit-learn-0.23.2 scikit-plot-0.3.7 scipy-1.5.4 smmap-5.0.0 tangled-up-in-unicode-0.1.0 umap-learn-0.5.2 visions-0.7.4 websocket-client-1.2.3
Requirement already satisfied: skorch in /usr/local/lib/python3.7/dist-packages (0.11.0)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from skorch) (0.8.9)
Requirement already satisfied: scikit-learn>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from skorch) (0.23.2)
Requirement already satisfied: tqdm>=4.14.0 in /usr/local/lib/python3.7/dist-packages (from skorch) (4.62.3)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from skorch) (1.19.5)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from skorch) (1.5.4)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.19.1->skorch) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.19.1->skorch) (3.0.0)
from pycaret.datasets import get_data
data = get_data("electrical_grid")
|
tau1 |
tau2 |
tau3 |
tau4 |
p1 |
p2 |
p3 |
p4 |
g1 |
g2 |
g3 |
g4 |
stabf |
0 |
2.959060 |
3.079885 |
8.381025 |
9.780754 |
3.763085 |
-0.782604 |
-1.257395 |
-1.723086 |
0.650456 |
0.859578 |
0.887445 |
0.958034 |
unstable |
1 |
9.304097 |
4.902524 |
3.047541 |
1.369357 |
5.067812 |
-1.940058 |
-1.872742 |
-1.255012 |
0.413441 |
0.862414 |
0.562139 |
0.781760 |
stable |
2 |
8.971707 |
8.848428 |
3.046479 |
1.214518 |
3.405158 |
-1.207456 |
-1.277210 |
-0.920492 |
0.163041 |
0.766689 |
0.839444 |
0.109853 |
unstable |
3 |
0.716415 |
7.669600 |
4.486641 |
2.340563 |
3.963791 |
-1.027473 |
-1.938944 |
-0.997374 |
0.446209 |
0.976744 |
0.929381 |
0.362718 |
unstable |
4 |
3.134112 |
7.608772 |
4.943759 |
9.857573 |
3.525811 |
-1.125531 |
-1.845975 |
-0.554305 |
0.797110 |
0.455450 |
0.656947 |
0.820923 |
unstable |
PyTorchModel
sktorch
라이브러리는 PyTorch 모델과 함께 작동한다.
- MLP 모델을 작성하는 클래스를 설계한다.
import torch.nn as nn
class Net(nn.Module):
def __init__(self, num_inputs=12, num_units_d1 = 200, num_units_d2 = 100):
super(Net, self).__init__()
self.dense0 = nn.Linear(num_inputs, num_units_d1)
self.nonlin = nn.ReLU()
self.dropout = nn.Dropout(0.5)
self.dense1 = nn.Linear(num_units_d1, num_units_d2)
self.output = nn.Linear(num_units_d2, 2)
self.softmax = nn.Softmax(dim=-1)
def forward(self, X, **kwargs):
X = self.nonlin(self.dense0(X))
X = self.dropout(X)
X = self.nonlin(self.dense1(X))
X = self.softmax(self.output(X))
return X
Skorch Classifier
- NeuralNetClassifier 클래스를 PyTorch 클래스와 연동한다.
- Optimizer 기본값인 SGD를 사용한다. 만약 다른 Optimizer로 변경을 원하면 다음 링크에서 확인한다.
- Sktorch 5 폴드 교차검증을 수행한다.
- 학습 데이터는 80%, 나머지 20%는 검증 데이터로 활용한다.
from skorch import NeuralNetClassifier
net = NeuralNetClassifier(
module = Net,
max_epochs = 30,
lr = 0.1,
batch_size = 32,
train_split = None
)
PyCaret과 신경망 학습 방법
- SKORCH NN model을 초기화 했다면, 이번에는 PyCaret과 함께 모델을 학습할 수 있다.
- PyCaret은 기본적으로 Pandas DataFrame을 메인 객체로 사용하다.
- 그런데, sktorch model을 사용하기 위해서는
pipeline
을 구성할 때는 DataFrameTransformer() 함수를 사용해야 한다.
from skorch.helper import DataFrameTransformer
import numpy as np
from sklearn.pipeline import Pipeline
nn_pipe = Pipeline(
[("transform", DataFrameTransformer()),
("net", net), ]
)
PyCaret Setup
- Skorch API 대신 PyCaret 모델을 사용해본다.
log_experiment
가 True
를 사용하게 되면 MLFlow
를 사용할 수 있다.
silent
가 True
인 경우 중간에 발생하는 press enter to continue
입력 단계를 피할 수 있다.
from pycaret.classification import *
target = "stabf"
clf1 = setup(data = data,
target = target,
train_size = 0.8,
fold = 5,
session_id = 123,
log_experiment = True,
experiment_name = 'electrical_grid_1',
silent = True)
|
Description |
Value |
0 |
session_id |
123 |
1 |
Target |
stabf |
2 |
Target Type |
Binary |
3 |
Label Encoded |
stable: 0, unstable: 1 |
4 |
Original Data |
(10000, 13) |
5 |
Missing Values |
False |
6 |
Numeric Features |
12 |
7 |
Categorical Features |
0 |
8 |
Ordinal Features |
False |
9 |
High Cardinality Features |
False |
10 |
High Cardinality Method |
None |
11 |
Transformed Train Set |
(8000, 12) |
12 |
Transformed Test Set |
(2000, 12) |
13 |
Shuffle Train-Test |
True |
14 |
Stratify Train-Test |
False |
15 |
Fold Generator |
StratifiedKFold |
16 |
Fold Number |
5 |
17 |
CPU Jobs |
-1 |
18 |
Use GPU |
False |
19 |
Log Experiment |
True |
20 |
Experiment Name |
electrical_grid_1 |
21 |
USI |
9626 |
22 |
Imputation Type |
simple |
23 |
Iterative Imputation Iteration |
None |
24 |
Numeric Imputer |
mean |
25 |
Iterative Imputation Numeric Model |
None |
26 |
Categorical Imputer |
constant |
27 |
Iterative Imputation Categorical Model |
None |
28 |
Unknown Categoricals Handling |
least_frequent |
29 |
Normalize |
False |
30 |
Normalize Method |
None |
31 |
Transformation |
False |
32 |
Transformation Method |
None |
33 |
PCA |
False |
34 |
PCA Method |
None |
35 |
PCA Components |
None |
36 |
Ignore Low Variance |
False |
37 |
Combine Rare Levels |
False |
38 |
Rare Level Threshold |
None |
39 |
Numeric Binning |
False |
40 |
Remove Outliers |
False |
41 |
Outliers Threshold |
None |
42 |
Remove Multicollinearity |
False |
43 |
Multicollinearity Threshold |
None |
44 |
Remove Perfect Collinearity |
True |
45 |
Clustering |
False |
46 |
Clustering Iteration |
None |
47 |
Polynomial Features |
False |
48 |
Polynomial Degree |
None |
49 |
Trignometry Features |
False |
50 |
Polynomial Threshold |
None |
51 |
Group Features |
False |
52 |
Feature Selection |
False |
53 |
Feature Selection Method |
classic |
54 |
Features Selection Threshold |
None |
55 |
Feature Interaction |
False |
56 |
Feature Ratio |
False |
57 |
Interaction Threshold |
None |
58 |
Fix Imbalance |
False |
59 |
Fix Imbalance Method |
SMOTE |
PyCaret Train Model
model = create_model("rf")
|
Accuracy |
AUC |
Recall |
Prec. |
F1 |
Kappa |
MCC |
0 |
0.9244 |
0.9796 |
0.9667 |
0.9189 |
0.9422 |
0.8331 |
0.8353 |
1 |
0.9275 |
0.9793 |
0.9549 |
0.9330 |
0.9438 |
0.8417 |
0.8422 |
2 |
0.9225 |
0.9810 |
0.9608 |
0.9211 |
0.9406 |
0.8294 |
0.8309 |
3 |
0.9081 |
0.9738 |
0.9461 |
0.9130 |
0.9293 |
0.7983 |
0.7993 |
4 |
0.9044 |
0.9738 |
0.9471 |
0.9071 |
0.9267 |
0.7894 |
0.7909 |
Mean |
0.9174 |
0.9775 |
0.9551 |
0.9186 |
0.9365 |
0.8184 |
0.8197 |
SD |
0.0093 |
0.0031 |
0.0079 |
0.0087 |
0.0071 |
0.0206 |
0.0206 |
PyCaret Train Skorch Model
- 이번에는 Skorch Model을 Pycaret 함수에 넣어서 확인해본다.
skorch_model = create_model(nn_pipe)
|
Accuracy |
AUC |
Recall |
Prec. |
F1 |
Kappa |
MCC |
0 |
0.8831 |
0.9644 |
0.9500 |
0.8769 |
0.9120 |
0.7389 |
0.7441 |
1 |
0.8550 |
0.9437 |
0.9569 |
0.8385 |
0.8938 |
0.6685 |
0.6831 |
2 |
0.8369 |
0.9280 |
0.9638 |
0.8146 |
0.8829 |
0.6202 |
0.6446 |
3 |
0.8506 |
0.9347 |
0.8668 |
0.8957 |
0.8810 |
0.6805 |
0.6812 |
4 |
0.8081 |
0.9411 |
0.9765 |
0.7789 |
0.8666 |
0.5400 |
0.5859 |
Mean |
0.8468 |
0.9424 |
0.9428 |
0.8409 |
0.8873 |
0.6496 |
0.6678 |
SD |
0.0245 |
0.0123 |
0.0390 |
0.0421 |
0.0151 |
0.0666 |
0.0519 |
Comparing Models
- 두 모델 중 어떤 모델이 더 좋은지 확인해본다.
best_model = compare_models(include=[skorch_model, model], sort = "AUC")
|
Model |
Accuracy |
AUC |
Recall |
Prec. |
F1 |
Kappa |
MCC |
TT (Sec) |
1 |
Random Forest Classifier |
0.9174 |
0.9775 |
0.9551 |
0.9186 |
0.9365 |
0.8184 |
0.8197 |
2.114 |
0 |
NeuralNetClassifier |
0.8426 |
0.9400 |
0.9547 |
0.8281 |
0.8861 |
0.6355 |
0.6565 |
11.878 |
Hyperparameter Grid
- Hyperparameter 튜닝을 적용하도록 한다.
- 모형 튜닝을 위한 parameter 값은 다음 명령어를 통해서 확인할 수 있다.
skorch_model.get_params().keys()
dict_keys(['memory', 'steps', 'verbose', 'transform', 'net', 'transform__float_dtype', 'transform__int_dtype', 'transform__treat_int_as_categorical', 'net__module', 'net__criterion', 'net__optimizer', 'net__lr', 'net__max_epochs', 'net__batch_size', 'net__iterator_train', 'net__iterator_valid', 'net__dataset', 'net__train_split', 'net__callbacks', 'net__predict_nonlinearity', 'net__warm_start', 'net__verbose', 'net__device', 'net___kwargs', 'net__classes', 'net__callbacks__epoch_timer', 'net__callbacks__train_loss', 'net__callbacks__train_loss__name', 'net__callbacks__train_loss__lower_is_better', 'net__callbacks__train_loss__on_train', 'net__callbacks__valid_loss', 'net__callbacks__valid_loss__name', 'net__callbacks__valid_loss__lower_is_better', 'net__callbacks__valid_loss__on_train', 'net__callbacks__valid_acc', 'net__callbacks__valid_acc__scoring', 'net__callbacks__valid_acc__lower_is_better', 'net__callbacks__valid_acc__on_train', 'net__callbacks__valid_acc__name', 'net__callbacks__valid_acc__target_extractor', 'net__callbacks__valid_acc__use_caching', 'net__callbacks__print_log', 'net__callbacks__print_log__keys_ignored', 'net__callbacks__print_log__sink', 'net__callbacks__print_log__tablefmt', 'net__callbacks__print_log__floatfmt', 'net__callbacks__print_log__stralign'])
dict_keys(['module', 'criterion', 'optimizer', 'lr', 'max_epochs', 'batch_size', 'iterator_train', 'iterator_valid', 'dataset', 'train_split', 'callbacks', 'predict_nonlinearity', 'warm_start', 'verbose', 'device', '_kwargs', 'classes', 'callbacks__epoch_timer', 'callbacks__train_loss', 'callbacks__train_loss__name', 'callbacks__train_loss__lower_is_better', 'callbacks__train_loss__on_train', 'callbacks__valid_loss', 'callbacks__valid_loss__name', 'callbacks__valid_loss__lower_is_better', 'callbacks__valid_loss__on_train', 'callbacks__valid_acc', 'callbacks__valid_acc__scoring', 'callbacks__valid_acc__lower_is_better', 'callbacks__valid_acc__on_train', 'callbacks__valid_acc__name', 'callbacks__valid_acc__target_extractor', 'callbacks__valid_acc__use_caching', 'callbacks__print_log', 'callbacks__print_log__keys_ignored', 'callbacks__print_log__sink', 'callbacks__print_log__tablefmt', 'callbacks__print_log__floatfmt', 'callbacks__print_log__stralign'])
import torch.optim as optim
custom_grid = {
'net__max_epochs':[20, 30],
'net__lr': [0.01, 0.05, 0.1],
'net__module__num_units_d1': [50, 100, 150],
'net__module__num_units_d2': [50, 100, 150],
'net__optimizer': [optim.Adam, optim.SGD, optim.RMSprop]
}
- 이번에는 hyperparameter 모델을 적용하여 모델을 빠르게 만들어 본다.
tuned_skorch_model = tune_model(skorch_model, custom_grid = custom_grid)
|
Accuracy |
AUC |
Recall |
Prec. |
F1 |
Kappa |
MCC |
0 |
0.8762 |
0.9667 |
0.9686 |
0.8562 |
0.9089 |
0.7182 |
0.7316 |
1 |
0.8675 |
0.9477 |
0.8784 |
0.9106 |
0.8942 |
0.7171 |
0.7179 |
2 |
0.8375 |
0.9452 |
0.7835 |
0.9535 |
0.8602 |
0.6706 |
0.6891 |
3 |
0.8575 |
0.9522 |
0.8208 |
0.9490 |
0.8803 |
0.7066 |
0.7180 |
4 |
0.7975 |
0.9315 |
0.9726 |
0.7704 |
0.8597 |
0.5127 |
0.5602 |
Mean |
0.8472 |
0.9487 |
0.8848 |
0.8879 |
0.8807 |
0.6650 |
0.6834 |
SD |
0.0280 |
0.0114 |
0.0763 |
0.0684 |
0.0192 |
0.0781 |
0.0631 |
References
- https://pycaret.org/
- https://www.analyticsvidhya.com/blog/2020/05/pycaret-machine-learning-model-seconds/
- https://github.com/skorch-dev/skorch
- https://towardsdatascience.com/skorch-pytorch-models-trained-with-a-scikit-learn-wrapper-62b9a154623e
- https://towardsdatascience.com/pycaret-skorch-build-pytorch-neural-networks-using-minimal-code-57079e197f33