삼성카드 대회 Track-2 - 포지셔닝 분석(2)

2020-09-04

머신러닝, Python, Machine Learning, Statistics

Page content

대회 소개

삼성카드 데이터분석 공모전이 시행되고 있다.
- 대회에 처음 참여하는 아시아경제-수강생들을 위해 일종의 가이드라인으로 제안하고자 한다.
본 포스트에서는 기본적인 내용만 전달하고자 함을 밝힌다.
- Track2 과정은 마케팅 전략 제안이 중요하다!

포지셔닝 분석 개요

마케팅에서 자주 보는 분석 방법중의 하나는 포지셔닝(Positioning) 기법이다.
포지셔닝 분석은 마케팅 통계분석 기법중의 하나로, 기업이나, 상품, 브랜드 같은 개체들의 포지셔닝을 수행하는 다차원 척도법(MDS: Multi-Dimensional Scaling)과 상응분석(Correspondence Analysis)이 있다.
위 두가지 분석 방법 중 무엇을 사용해야 할까?
- 만약 데이터셋이 주로 등간척도, 비율척도와 같이 구성되어 있다면 다차원 척도법
- 만약 데이터셋이 주로 명목척도, 서열척도와 같이 구성되어 있다면 상응분석
현재 삼성카드 대회의 주 데이터셋은 명목척도 및 서열척도로 구성되어 있기 때문에 상응분석으로 시작하면 된다.

상응분석

Correspondence Analysis는 범주형 변수(수준)들 간의 연관성을 분석한 후, 그 결과를 시각적 해석이 용이하도록 그래프화 하는 것임
삼성카드 대회 Track-2 - 포지셔닝 분석(1)

(1) 기본 개념

상응분석을 사용하려면 빈도교차표를 만들어야 한다.
요약하면, 상응분석은 범주형 변수의 빈도를 나타내고 있는 빈도교차표의 행과 열(명목변수의 범주 값들)을 그래프상의 자극점 형태로 표시하는 방법.
이 때, 단순 상응분석은 2개의 변수, 다중 상응분석은 3개 이상의 변수 활용한다.
이 때, 상응분석은 카이제곱 검정과 같이 범주형 변수간의 상호연관성을 바탕으로 진행된다.
- 따라서, 두 개의 범주형 변수가 서로 연관성을 가지고 있다는 전제하에서 진행된다.

(2) 데이터 불러오기

먼저 필요한 데이터를 불러온다.
필자는 구글 코랩에서 데이터를 불러오기 때문에

%config InlineBackend.figure_format = 'retina'
!sudo apt-get -qq -y install fonts-nanum

The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  fonts-nanum
0 upgraded, 1 newly installed, 0 to remove and 39 not upgraded.
Need to get 9,604 kB of archives.
After this operation, 29.5 MB of additional disk space will be used.
Selecting previously unselected package fonts-nanum.
(Reading database ... 144579 files and directories currently installed.)
Preparing to unpack .../fonts-nanum_20170925-1_all.deb ...
Unpacking fonts-nanum (20170925-1) ...
Setting up fonts-nanum (20170925-1) ...
Processing triggers for fontconfig (2.12.6-0ubuntu2) ...

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.font_manager as fm

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

for font in fm.fontManager.ttflist:
    if 'Nanum' in font.name:
        print(font.name, font.fname)

NanumGothic /usr/share/fonts/truetype/nanum/NanumGothic.ttf
NanumSquareRound /usr/share/fonts/truetype/nanum/NanumSquareRoundR.ttf
NanumSquare /usr/share/fonts/truetype/nanum/NanumSquareR.ttf
NanumSquareRound /usr/share/fonts/truetype/nanum/NanumSquareRoundB.ttf
NanumBarunGothic /usr/share/fonts/truetype/nanum/NanumBarunGothicBold.ttf
NanumMyeongjo /usr/share/fonts/truetype/nanum/NanumMyeongjo.ttf
NanumMyeongjo /usr/share/fonts/truetype/nanum/NanumMyeongjoBold.ttf
NanumBarunGothic /usr/share/fonts/truetype/nanum/NanumBarunGothic.ttf
NanumSquare /usr/share/fonts/truetype/nanum/NanumSquareB.ttf
NanumGothic /usr/share/fonts/truetype/nanum/NanumGothicBold.ttf

fontpath = '/usr/share/fonts/truetype/nanum/NanumGothic.ttf'
font = fm.FontProperties(fname=fontpath, size=9)
plt.rc('font', family='NanumGothic') 
plt.rcParams["figure.figsize"] = (20, 10)
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.color'] = 'r'
plt.rcParams['axes.grid'] = True 
mpl.font_manager._rebuild()
fm._rebuild()

한글 텍스트가 잘 나오는지 확인해본다.

font = {'family' : 'NanumGothic',
        'weight' : 'bold',
        'size'   : 22}

plt.rc('font', **font)
plt.text(0.3, 0.3, '한글', size=100)

Text(0.3, 0.3, '한글')

png

# Mount Google Drive
from google.colab import drive # import drive from google colab

ROOT = "/content/drive"     # default location for the drive
print(ROOT)                 # print content of ROOT (Optional)
drive.mount(ROOT)           # we mount the google drive at /content/drive

/content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

# import join used to join ROOT path and MY_GOOGLE_DRIVE_PATH
from os.path import join  

# path to your project on Google Drive
MY_GOOGLE_DRIVE_PATH = 'My Drive/Colab Notebooks/samsung/datasets'

PROJECT_PATH = join(ROOT, MY_GOOGLE_DRIVE_PATH)
print(PROJECT_PATH)

/content/drive/My Drive/Colab Notebooks/samsung/datasets

%cd "{PROJECT_PATH}"

/content/drive/My Drive/Colab Notebooks/samsung/datasets

%ls

'[기타] SCDC_공모전 테이블 설명.xlsx'  '[Track1_데이터4] variable_dtype.xlsx'
'[Track1_데이터1] mrc_info.csv'        '[Track2_데이터1] trend_w_demo.csv'
'[Track1_데이터2] samp_train.csv'      '[Track2_데이터2] 업종_예시.xlsx'
'[Track1_데이터3] samp_cst_feat.csv'

import pandas as pd

trend_w_demo = pd.read_csv('[Track2_데이터1] trend_w_demo.csv', encoding="EUC-KR")
trend_w_demo.head()

	YM	Category	성별구분	연령대	기혼스코어	유아자녀스코어	초등학생자녀스코어	중고생자녀스코어	대학생자녀스코어	전업주부스코어
0	202005	할인점	0	F	high	low	high	mid	low	low
1	202005	취미	0	B	high	low	mid	mid	low	low
2	202005	오픈마켓/소셜	1	D	mid	mid	mid	mid	low	mid
3	202005	뷰티	0	D	mid	mid	mid	mid	low	low
4	202005	오픈마켓/소셜	0	G	high	low	mid	mid	mid	low

위 데이터는 전형적으로 범주형으로 구성이 되어 있다.
위 변수에서 고려할 것이 있다.
- Category, 성별구분, 연령대는 명목형 변수인데 반해
- 기혼스코어:전업주부스코어의 경우는 확률이다.
각 스코어의 확률에 대한 정의는 다음과 같다.
- 기혼스코어 : 카드 이용 고객이 기혼일 확률
- 유아자녀스코어 : 카드 이용 고객이에게 유아자녀가 있을 확률
- 초등학생자녀스코어 : 카드 이용 고객이에게 초등학생 자녀가 있을 확률
- 중고생자녀스코어 : 카드 이용 고객이에게 중고생 자녀가 있을 확률
- 대학생자녀스코어 : 카드 이용 고객이에게 대학생 자녀가 있을 확률
- 전업주부스코어 : 카드 이용 고객이 전업주부일 확률
위 스코어의 데이터는 단순히 명목형이라고 보기는 어렵다. 즉, 순위(서열)형 척도/등간척도로 봐야 할 것이다.

(3) 데이터셋 필터링

우선, 코로나 이후의 상황을 보기 위한 것인 만큼 2019년 자료는 삭제한다.

new_df = trend_w_demo[trend_w_demo['YM'].isin([202004, 202005])]
trend_w_demo.shape, new_df.shape

((452038, 10), (210468, 10))

약 절반 가까이 데이터를 줄였다.

(4) 단순 상응분석의 개념적 분석 과정

본 포스트에서는 내용에 대해서는 디테일하게 다루지 않는다.
자세한 것은 Reference 교재를 확인한다.
첫째, 빈도교차표 형태의 데이터를 준비한다.
둘째, 기대 빈도교차표를 작성한다.
셋째, 카이제곱 행렬표를 작성한다.
넷째, 차원 좌표 생성한다.
다섯째, 마지막으로 좌표계 맵핑을 진행한다.

!pip install prince

Collecting prince
  Downloading https://files.pythonhosted.org/packages/51/f4/8de7003b86351a0e32e29ca2bbbbbf58e311b09f9286e83e638d437aee6d/prince-0.7.0-py3-none-any.whl
Requirement already satisfied: pandas>=1.0.3 in /usr/local/lib/python3.6/dist-packages (from prince) (1.0.5)
Requirement already satisfied: matplotlib>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from prince) (3.2.2)
Requirement already satisfied: scikit-learn>=0.22.1 in /usr/local/lib/python3.6/dist-packages (from prince) (0.22.2.post1)
Requirement already satisfied: scipy>=1.3.0 in /usr/local/lib/python3.6/dist-packages (from prince) (1.4.1)
Requirement already satisfied: numpy>=1.17.1 in /usr/local/lib/python3.6/dist-packages (from prince) (1.18.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=1.0.3->prince) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=1.0.3->prince) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.2->prince) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.2->prince) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.2->prince) (2.4.7)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.22.1->prince) (0.16.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas>=1.0.3->prince) (1.15.0)
Installing collected packages: prince
Successfully installed prince-0.7.0

(5) 함수화

계속 반복될 것 같으니, 함수화를 진행하자.

# module 
import pandas as pd
import matplotlib.pyplot as plt
import prince

# 폰트 세팅
font = {'family' : 'NanumGothic',
        'weight' : 'bold',
        'size'   : 14}

plt.rc('font', **font)

def plot_mca(X = None, var_list = None):
    """다중 상응분석 그래프 작성

    Keyword arguments:
    X -- 명목형 판다스데이터프레임 (default None)
    var_list -- 입력 변수명 리스트로 작성
    """
    
    input_X = X[var_list]
    mca = prince.MCA(n_components=2).fit(input_X)
    
    # 시각화
    ax = mca.plot_coordinates(X = input_X, figsize=(10, 10), show_column_labels=True)
    ax.set_title("다중 상응분석", fontsize = 24)

(6) 다중 상응분석: 기혼스코어, 유아자녀스코어 기준

주어진 함수를 바탕으로 다중 상응분석을 시각화해본다.
이 때, 성별구분(현재 int)을 제외하고 나머지 변수들에 대해 다양하게 스코어를 입력해보자.
- 이 때, 순서는 [Category, name_of_score, ...]으로 입력하면 된다.

plot_mca(new_df, var_list = ['Category', '기혼스코어', '유아자녀스코어'])

/usr/local/lib/python3.6/dist-packages/matplotlib/backends/backend_agg.py:214: RuntimeWarning: Glyph 8722 missing from current font.
  font.set_text(s, 0.0, flags=flags)
/usr/local/lib/python3.6/dist-packages/matplotlib/backends/backend_agg.py:183: RuntimeWarning: Glyph 8722 missing from current font.
  font.set_text(s, 0, flags=flags)

png

특이한 것은 유아자녀스코어가 높을수록 종합몰에 관련성이 있었고, 기혼스코어가 낮을수록 면세점, 취미, 뷰티, 호텔숙박의 연관성이 있음을 확인했다.
유아자녀스코어_low & 기혼스코어_high는 방향성이 어느정도 일치하는 것을 확인할 수 있다.
- 위 두 그룹 주변에는 할인점, 전문몰, 디저트 등으로 배치가 되어 있다.
유아자녀스코어_mid & 기혼스코어_mid도 같은 방향으로 일치하고 있다.

결론

범주형 데이터를 다루는 것은 쉽지 않다.
그러나, 실제 마케팅에서는 이러한 범주형 데이터가 많은 것 또한 사실이다.
단순하게, 막대 그래프를 그리는 것보다, 각 데이터의 차이를 앎으로써, 보다 직관적으로 해석해줄 수 있는 이러한 분석은 활용하기에 매우 좋다.
기혼스코어를 통해서 조금 유의미한 결과를 도출할 수 있었다.
- 그러나, 그 외 나머지 스코어에서는 아직 이렇다할 변화를 감지하지는 못했다.
다중 상응분석을 통해서 보다 다변수간의 포지셔닝을 확인할 수 있었다.
이제 마지막으로, 성별을 확인해보자.
- 마케팅에서 결국 중요한 것은 성별이다.

Reference

김형수(2020). Step by Step 파이썬 비즈니스 통계분석. 서울: 프레딕스.