Programmings

개요

윈도우에서 NiFi를 설치해본다.
NiFi를 설치하기 위해서는 자바 설치가 필요하다.

Step 01. NiFi 다운로드

먼저 웹사이트에 접속한다.
- URL : https://www.apache.org/dyn/closer.lua?path=/nifi/1.16.0/nifi-1.16.0-bin.zip

/img/programming/2022/04/apache_nifi_installation_windows/apache_nifi_installation_windows

Untitled

가장 먼저 나오는 링크를 클릭한다.
- URL : https://dlcdn.apache.org/nifi/1.16.0/nifi-1.16.0-bin.zip
다운로드 받은 파일의 압축을 풀도록 한다.

Step 02. Java 환경 설정

Java 설치 내용은 아래 블로그를 참조한다.
참고자료 : https://maktony.tistory.com/13

Step 03. run-nifi 배치 파일 실행

run-nifi 배치파일을 관리자 권한으로 실행한다.

Untitled

아래와 같은 메시지가 출력이 되면 성공한 것이다.

Untitled

Step 04. Web UI 확인

(약 1분이 지난 후) Web UI를 확인해본다.
- https://127.0.0.1:8443/nifi/login

Untitled

개요

Windows WSL2에서 airflow를 설치한다.

Step 1. Install pip on WSL

airflow를 설치하기 위해 pip를 설치한다.

$ sudo apt install python3-pip
[sudo] password for username:

Step 2. Install virtualenv package

virtualenv 라이브러리를 설치한다.

$ sudo pip3 install virtualenv

Step 3. Create a virtual environment

C드라이브에 airflow-test 폴더를 생성한다.
- 해당 디렉터리로 이동한다.
이제 가상환경을 생성한다.

$ virtualenv venv

가상환경에 접속을 한다.

$ source venv/bin/activate

이번에는 .bashrc 파일을 수정한다.

$ vi ~/.bashrc

파일을 연 후, 다음과 같은 코드를 추가한다.

export AIRFLOW_HOME=/mnt/c/airflow-test

파일을 닫을 때는 ESC → :wq 순서대로 작성한다.
수정된 코드를 업데이트 하기 위해서는 아래와 같이 반영한다.

$ source ~/.bashrc

실제로 코드가 반영되었는지 확인하기 위해서는 다음과 같이 확인해본다.

echo $AIRFLOW_HOME
/mnt/c/airflow-test

Step 4. Apache Airflow 설치

PostgreSQL, Slack, Celery 패키지를 동시에 설치하는 코드를 작성한다.

$ pip3 install 'apache-airflow[postgres, slack, celery]'

에어플로 실행 위해 DB 초기화를 해줘야 한다.

$ airflow db init

실제로 잘 구현이 되었는지 확인하기 위해 webserver를 실행한다.

$ airflow webserver -p 8081

다음으로 일정 주기로 데이터 흐름이 실행되게 하려면 Scheduler가 필요하다.

$ airflow scheduler

그리고, 해당 링크 http://localhost:8081/login/ 에 접속하면 아래와 같은 화면이 나타난다.

Untitled

개요

주요 참고자료는 다음과 같다.
- WSL2 설치 : https://www.lainyzine.com/ko/article/how-to-install-wsl2-and-use-linux-on-windows-10/#google_vignette
- 도커 설치 : https://www.lainyzine.com/ko/article/a-complete-guide-to-how-to-install-docker-desktop-on-windows-10/

Step 1. WSL2 설치 과정

Windows PowerShell 관리자로 실행 후 다음 명령어 입력

$ dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
$ dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

위 명령어 실행 후, 재부팅 필수
x64 머신용 최신 WSL2 Linux 커널 업데이트 패키지를 다운로드 받아 안내에 따라 설치합니다.
Windows Powershell 열고 아래 코드 실행

$ wsl --set-default-version 2
WSL 2와의 주요 차이점에 대한 자세한 내용은 https://aka.ms/wsl2를 참조하세요

Step 2l Docker Desktop 설치

다음 페이지로 이동해서 Docker Desktop for Windows를 다운로드 받습니다.

Docker Desktop for Mac and Windows | Docker

Untitled

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

개요

One-Hot Encoding 개념에 대해 이해한다.
One-Hot Encoder 사용법을 익힌다.

One-Hot Encoding

One-Hot Encoding은 문자를 숫자로 변환하는 것이다.
먼저 그림을 보면서 이해하도록 한다.

머신러닝 알고리즘은 데이터가 모두 숫자인 것으로 이해하기 때문에 모두 변환해주어야 한다.

OnetHotEncoder

OneHotEncoder는 Scikit-Learn 라이브러리에 있는 클래스이다.
- 자세한 내용은 링크를 참조한다.
먼저 예시를 참조한다.

import sklearn
print("sklearn ver.", sklearn.__version__)

sklearn ver. 1.0.2

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit_transform(X).toarray()

array([[0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.]])

예시 코드를 보면 위 그림과 결괏값이 다르게 나오는 걸 확인할 수 있다.
보통 우리가 다루는 데이터는 pandas 데이터프레임이기 때문에, 입문자분들에게는 거리감이 느껴질 수 있다.
그래서 pandas 데이터프레임 데이터를 가져와서 테스트를 해보았다.

from sklearn.preprocessing import OneHotEncoder
from seaborn import load_dataset

penguins = load_dataset('penguins')
ohe = OneHotEncoder()
transformed = ohe.fit_transform(penguins[['island']])
print(transformed.toarray())
print(ohe.categories_)
print(penguins['island'].unique())

[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 ...
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]
[array(['Biscoe', 'Dream', 'Torgersen'], dtype=object)]
['Torgersen' 'Biscoe' 'Dream']

이제 해당 코드를 기존 데이터프레임에 추가하도록 한다.

print(penguins.head())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female

penguins[ohe.categories_[0]] = transformed.toarray()
print(penguins.head())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  Biscoe  Dream  Torgersen  
0       3750.0    Male     0.0    0.0        1.0  
1       3800.0  Female     0.0    0.0        1.0  
2       3250.0  Female     0.0    0.0        1.0  
3          NaN     NaN     0.0    0.0        1.0  
4       3450.0  Female     0.0    0.0        1.0

만약 다중 문자열 컬럼을 한다면?

위 예시는 변경하려는 컬럼이 1개일 때는 시의적절하게 사용할 수 있다.
그러나, 보통 캐글이나 데이콘 같은 대회에서는 여러개의 문자열 컬럼을 변환시켜야 한다.
물론, 프로그래밍 능력을 갖춘 분이라면, 반복문을 사용해서 처리할 수도 있다.
그러나, sklearn.compose.make_column_transformer 클래스를 활용하면 보다 쉽게 처리할 수 있다.
- 참조 : https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import make_column_transformer
from seaborn import load_dataset
import pandas as pd

penguins = load_dataset('penguins')
sample_cols = ['island', 'sex', 'bill_length_mm', 'species']
penguins = penguins[sample_cols]

# 결측치 제거 
penguins = penguins.dropna()
print(penguins.head())
print(penguins.info())

      island     sex  bill_length_mm species
0  Torgersen    Male            39.1  Adelie
1  Torgersen  Female            39.5  Adelie
2  Torgersen  Female            40.3  Adelie
4  Torgersen  Female            36.7  Adelie
5  Torgersen    Male            39.3  Adelie
<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   island          333 non-null    object 
 1   sex             333 non-null    object 
 2   bill_length_mm  333 non-null    float64
 3   species         333 non-null    object 
dtypes: float64(1), object(3)
memory usage: 13.0+ KB
None

categorical_cols = ['island', 'sex']
label_cols = ['species']

transformer = make_column_transformer(
    (OneHotEncoder(), categorical_cols),
    remainder='passthrough', 
    verbose_feature_names_out = False)

transformed = transformer.fit_transform(penguins)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
print(transformed_df.head())

  island_Biscoe island_Dream island_Torgersen sex_Female sex_Male  \
0           0.0          0.0              1.0        0.0      1.0   
1           0.0          0.0              1.0        1.0      0.0   
2           0.0          0.0              1.0        1.0      0.0   
3           0.0          0.0              1.0        1.0      0.0   
4           0.0          0.0              1.0        0.0      1.0   

  bill_length_mm species  
0           39.1  Adelie  
1           39.5  Adelie  
2           40.3  Adelie  
3           36.7  Adelie  
4           39.3  Adelie

OrdinalEncoder 클래스와 같이 사용이 가능한가?

이번에는 OrdinalEncoder 클래스와 같이 사용을 하도록 한다.

import pandas as pd
from seaborn import load_dataset

tips = load_dataset('tips')

# 결측치 제거 
tips = tips.dropna()
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 9.1 KB
None

위 데이터에서 sex, day는 onehot encoding을 진행하고, smoker와 time은 ordinal encoding을 동시 진행해본다.
또한, numeric features를 위해 스케일러도 진행했다.
그 후, 새로운 데이터 프레임으로 변환하는 코드를 작성한다.
ColumnTransformer 메서드 적용 후, get_feature_names()를 얻기 위해서는 helper 함수가 필요하다.
- 함수는 해당 링크에서 가져왔다.

import warnings
import sklearn
import pandas as pd
import numpy as np

def get_feature_names(column_transformer):
    """Get feature names from all transformers.
    Returns
    -------
    feature_names : list of strings
        Names of the features produced by transform.
    """
    # Remove the internal helper function
    #check_is_fitted(column_transformer)
    
    # Turn loopkup into function for better handling with pipeline later
    def get_names(trans):
        # >> Original get_feature_names() method
        if trans == 'drop' or (
                hasattr(column, '__len__') and not len(column)):
            return []
        if trans == 'passthrough':
            if hasattr(column_transformer, '_df_columns'):
                if ((not isinstance(column, slice))
                        and all(isinstance(col, str) for col in column)):
                    return column
                else:
                    return column_transformer._df_columns[column]
            else:
                indices = np.arange(column_transformer._n_features)
                return ['x%d' % i for i in indices[column]]
        if not hasattr(trans, 'get_feature_names'):
        # >>> Change: Return input column names if no method avaiable
            # Turn error into a warning
            warnings.warn("Transformer %s (type %s) does not "
                                 "provide get_feature_names. "
                                 "Will return input column names if available"
                                 % (str(name), type(trans).__name__))
            # For transformers without a get_features_names method, use the input
            # names to the column transformer
            if column is None:
                return []
            else:
                return [name + "__" + f for f in column]

        return [name + "__" + f for f in trans.get_feature_names()]
    
    ### Start of processing
    feature_names = []
    
    # Allow transformers to be pipelines. Pipeline steps are named differently, so preprocessing is needed
    if type(column_transformer) == sklearn.pipeline.Pipeline:
        l_transformers = [(name, trans, None, None) for step, name, trans in column_transformer._iter()]
    else:
        # For column transformers, follow the original method
        l_transformers = list(column_transformer._iter(fitted=True))
    
    
    for name, trans, column, _ in l_transformers: 
        if type(trans) == sklearn.pipeline.Pipeline:
            # Recursive call on pipeline
            _names = get_feature_names(trans)
            # if pipeline has no transformer that returns names
            if len(_names)==0:
                _names = [name + "__" + f for f in column]
            feature_names.extend(_names)
        else:
            feature_names.extend(get_names(trans))
    
    return feature_names

이제 위 함수들을 적용해서 각 인코딩과 사용하지 않는 컬럼들을 하나로 합치는 코드를 작성해본다.

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

categorical_cols = ['sex', 'day']
ordinal_cols = ['smoker', 'time']
numeric_cols = ['total_bill']
keep_features = [x for x in tips.columns if x not in categorical_cols + ordinal_cols + numeric_cols]

tips2 = tips[categorical_cols + ordinal_cols + numeric_cols]

transformer = ColumnTransformer(
    [('StandardScaler', StandardScaler(), numeric_cols),
     ('OneHotEncoder', OneHotEncoder(), categorical_cols),
     ('OrdinalEncoder', OrdinalEncoder(), ordinal_cols)],
    remainder='passthrough', 
    verbose_feature_names_out = False)

transformed = transformer.fit_transform(tips2)
transformed_df = pd.DataFrame(transformed, columns=get_feature_names(transformer))
tip3 = pd.concat([tips[keep_features], transformed_df], axis = 1)
tip3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tip                         244 non-null    float64
 1   size                        244 non-null    int64  
 2   StandardScaler__total_bill  244 non-null    float64
 3   OneHotEncoder__x0_Female    244 non-null    float64
 4   OneHotEncoder__x0_Male      244 non-null    float64
 5   OneHotEncoder__x1_Fri       244 non-null    float64
 6   OneHotEncoder__x1_Sat       244 non-null    float64
 7   OneHotEncoder__x1_Sun       244 non-null    float64
 8   OneHotEncoder__x1_Thur      244 non-null    float64
 9   OrdinalEncoder__smoker      244 non-null    float64
 10  OrdinalEncoder__time        244 non-null    float64
dtypes: float64(10), int64(1)
memory usage: 22.9 KB


/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:38: UserWarning: Transformer StandardScaler (type StandardScaler) does not provide get_feature_names. Will return input column names if available
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:38: UserWarning: Transformer OrdinalEncoder (type OrdinalEncoder) does not provide get_feature_names. Will return input column names if available

일단 임시로 작업을 하기는 했으나, 뭔가 깔끔해보이지는 않는다.
만약 작업을 한다면, 한꺼번에 하지 말고, 각 단계별로 pipeline을 구성 후, 순차적으로 하는 것이 현재로써는 좀 더 “정신건강상 좋아보인다!”

개요

skleran.tree.plot_tree의 색상을 바꿔보도록 한다.
matplotlib 객체지향의 구조를 알면 어렵지(?) 않게 바꿀 수 있다.
간단하게 plot_tree 시각화를 구현해본다.
- 언제나 예제로 희생당하는 iris 데이터에게 애도를 표한다.
구글코랩에서 실행 시, 다음 코드를 실행하여 최신 라이브러리로 업그레이드 한다.

!pip install -U matplotlib

Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (3.2.2)
Collecting matplotlib
  Downloading matplotlib-3.5.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
[K     |████████████████████████████████| 11.2 MB 27.0 MB/s 
[?25hRequirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.4.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.21.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (7.1.2)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (3.0.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (21.3)
Collecting fonttools>=4.22.0
  Downloading fonttools-4.31.2-py3-none-any.whl (899 kB)
[K     |████████████████████████████████| 899 kB 50.5 MB/s 
[?25hRequirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib) (3.10.0.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7->matplotlib) (1.15.0)
Installing collected packages: fonttools, matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.2.2
    Uninstalling matplotlib-3.2.2:
      Successfully uninstalled matplotlib-3.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed fonttools-4.31.2 matplotlib-3.5.1

%matplotlib inline 

import sklearn
print(sklearn.__version__)
import matplotlib
print(matplotlib.__version__)

# 필수 라이브러리 불러오기
from sklearn.datasets import load_iris
from sklearn import tree 
import matplotlib.pyplot as plt

# 데이터 불러오기
iris = load_iris()
print(iris.data.shape, iris.target.shape)
print("feature names", iris.feature_names)
print("class names", iris.target_names)

# 모형 학습 및 plot_tree 그래프 구현
dt = tree.DecisionTreeClassifier(random_state=0)
dt.fit(iris.data, iris.target)

fig, ax = plt.subplots(figsize=(10, 6))
ax = tree.plot_tree(dt, max_depth = 2, filled=True, feature_names = iris.feature_names, class_names = iris.target_names)
plt.show()

1.0.2
3.5.1
(150, 4) (150,)
feature names ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
class names ['setosa' 'versicolor' 'virginica']

png

Step 01 - 빅카인즈 접속 후, 데이터 내려받기

싸이트 : https://www.bigkinds.or.kr/v2/news/index.do
해당 싸이트에서 키워드를 입력 한다.
- 이 때, 기간, 신문사 등을 선택할 수 있다.
- 필자는 키워드는 ‘사회적 경제’ 신문사는 국민일보, 조선일보, 중앙일보를 선택한다.
- 하단으로 내려 적용하기 버튼을 클릭한다.

Screen Shot 2022-03-13 at 12.27.40 AM.png

Step 03 - 분석 결과 및 시각화 탭을 클릭한다.
- 데이터 다운로드 탭 하단에 엑셀 다운로드 버튼을 클릭한다.

Screen Shot 2022-03-13 at 12.39.59 AM.png

해당 파일에는 본문이 있지만, 보통 200자 내외로 짧게 요약이 되어 있다.

Step 02 - 웹 크롤링 소스 코드 작성을 위한 사전 준비

먼저 기 다운로드 된 파일을 불러온다.
전체 데이터에서 필요한 컬럼만 재추출한다.

> library(dplyr)
> library(readxl)
> raw_df = read_excel("data/NewsResult_20211213-20220313.xlsx", sheet = 1)
> raw_df2 = raw_df %>% select(일자, 언론사, 제목, URL)
> raw_df2 %>% group_by(언론사) %>% summarise(n = n())
# A tibble: 3 × 2  
언론사       n  
<chr>    <int>
1 국민일보   180
2 조선일보   115 
3 중앙일보   256

각 신문사별로 나눠서 객체를 저장한다. 여기에서는 국민일보만 추출하는 코드를 예시로 하였다.

> kmib_df = raw_df2 %>% filter(언론사 == "국민일보")
> head(kmib_df, 3)
# A tibble: 3 × 4  일자     언론사   제목                                         URL                   
<chr>    <chr>    <chr>                                        <chr>               
1 20220312 국민일보 팬데믹의 비극 무너지는 가정, 스러지는 아이들 http://news.kmib.co…
2 20220312 국민일보 ‘장로’ 디딤돌인가 걸림돌인가                 http://news.kmib.co…
3 20220311 국민일보 [기고]대전은 우리가 지킨다                   http://news.kmib.co…

이번에는 URL만 추출하여 특이점이 있는지 확인한다.
- 전체적으로 주소는 비슷하다.
- 몇몇 주소에서는 &cp=kd 같은 문자가 더 추가된 것을 확인할 수 있다.
- 서로 다른 싸이트에서 본문의 위치 등이 동일한지 다른지 확인할 필요가 있다. (확인 결과, 차이는 없다!)

> kmib_df$URL[1:5]
[1] "http://news.kmib.co.kr/article/view.asp?arcid=0924235144&code=11131100"      
[2] "http://news.kmib.co.kr/article/view.asp?arcid=0924235120&code=23111111"      
[3] "http://news.kmib.co.kr/article/view.asp?arcid=0016856942&code=61221514&cp=kd"
[4] "http://news.kmib.co.kr/article/view.asp?arcid=0016853803&code=61111711&cp=kd"
[5] "http://news.kmib.co.kr/article/view.asp?arcid=0016847353&code=61131111&cp=kd"

Step 03 - 웹 크롤링 본문 내용 발췌

이번에는 본문만 크롤링 하도록 한다.
- 1개의 데이터만 가져와서 테스트를 해본다.

> url = kmib_df$URL[1]
> news = read_html(url, encoding = "EUC-KR")
> news
{html_document}<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko">
[1] <head>\n<title>팬데믹의 비극… 무너지는 가정, 스러지는 아이들-국민일보</title>\n<meta http-equiv="Cont ...
[2] <body>\r\n<div id="wrap">\r\n\r\n<!-- header -->\r\n\r\n<script src="http://ww ...

여기에서 div 태그 안에 있는 class tx 하단에 텍스트가 있는 것을 확인할 수 있다.

Screen Shot 2022-03-13 at 4.36.47 PM.png

개요

Heroku App을 배포하는 과정을 작성한다.
가장 중요한 것은 Git과 연동이 되어 있어야 한다.
- 깃허브 : https://github.com/
- GIT : https://git-scm.com/
- 이 부분에 대한 설치 과정은 생략한다.
배포하려는 프로젝트는 다음 링크에서 확인한다.
- 참고 : Python Sales Dashboard Using Dash and Plotly

Procfile 생성

프로젝트 Root 디렉터리에 Procfile 을 생성한다.

web: gunicorn index:server

이 때, index 파일명을 의미한다.

작업 파일 수정

index.py을 열고, 다음 코드를 추가한다.
- server = app.server 을 추가한다.

app = dash.Dash(__name__, meta_tags=[{"name": "viewport",
                                      "content": "width=device-width"}])

server = app.server

Runtime 파일 추가

어떤 파이썬 버전에서 실행할 것인지 해당 코드를 작성한다. (runtime.txt)
- 마찬가지로 프로젝트의 Root 디렉토리에서 생성한다.

python-3.8.7

Heroku 로그인 및 App 생성

Heroku 회원가입을 안했다면, 진행한다.

개요

Sales 데이터를 활용하여 대시보드를 만드는 과정을 제작한다.
기본 파이썬 코딩은 할 줄 안다는 전제하에 작성하며, 세부 내용이 필요하면 참고 자료를 확인할 것을 권한다.
윈도우 10에서 본 프로젝트를 수행하였다.

Chapter 1. Github Repo 생성

필자는 Github 레포를 만들었다. (Repo 명: python_dash_sales)
git clone을 통해서 로컬로 가져온다.

$ git clone https://github.com/your_id/python_dash_sales.git

Chapter 2. Python 프로젝트 생성

PyCharm을 주 에디터로 사용할 예정이다.

개요

BigQuery ML을 소개한다.
BigQuery ML을 사용하면, 머신러닝 모델을 만들고 또한 실행할 수 있다.

목표

BigQuery ML에서 CREATE MODEL 문을 사용하여 선형회귀 모델 만들기
ML.EVALUATE 함수를 사용하여 ML 모델 평가
ML.PREDICT 함수를 사용하여 ML 모델 예측

주의 사항

BigQuery 비용 관련된 문서는 다음과 같다.
- BigQuery 가격 책정: https://cloud.google.com/bigquery/pricing
- BigQuery 가격 책정**:** https://cloud.google.com/bigquery-ml/pricing

1단계: 데이터 세트 만들기

데이터 세트 ID에 bqml_practice 입력
데이터 위치로 미국 US 선택
나머지는 모두 Default로 설정한다.

Untitled

2단계: 모델 만들기

데이터 소개

먼저 데이터를 소개한다.
데이터 원 자료는 해당 논문에서 확인할 수 있다.
- https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081
데이터 셋에 대한 설명은 다음과 같다.
- species — 펭귄의 종(문자열)
- island — 펭귄이 사는 섬(문자열)
- culmen_length_mm — 컬멘 길이(밀리미터)(FLOAT64).
- culmen_depth_mm — 컬멘 깊이(밀리미터)(FLOAT64)
- flipper_length_mm — 지느러미의 길이(밀리미터)(FLOAT64)
- sex — 펭귄의 성별(문자열)

모델 만들기 코드 실행

CREATE MODEL 명령어를 실행하여 모델을 생성한다.

#standardSQL
CREATE OR REPLACE MODEL `bqml_practice.penguins_model`
OPTIONS
  (model_type='linear_reg',
  input_label_cols=['body_mass_g']) AS
SELECT
  *
FROM
  `bigquery-public-data.ml_datasets.penguins`
WHERE
  body_mass_g IS NOT NULL

실행 결과는 보는 것처럼 Preprocess, Train, Evaluate 작업이 진행 된 것을 확인할 수 있다.

Untitled

개요

GCP 빅쿼리를 연동하는 예제를 구현한다.
먼저 빅쿼리를 통해 데이터를 적재하는 예제를 확인한다.
구글 코랩에서 빅쿼리 데이터를 불러온다.
데이터 스튜디오에서 빅쿼리 데이터를 불러온다.

소개

빅쿼리를 소개하는 영상은 유투브에서 검색하면 매우 쉽게 확인할 수 있다.
- 영상 참조: 데이터 웨어하우스 끝판왕 BigQuery 어디까지 알고 계신가요

Google Cloud 회원가입

준비물
- Google 계정
- 신용카드나 체크카드 (개인적으로 돈이 없는 체크카드 사용 권장)
구글 클라우드 사이트 접속
- 싸이트: https://cloud.google.com/
무료 서버 받으려면 아래 화면에서 TRY IT FREE 를 클릭한다.

Untitled