Python 통계 - 포지셔닝 분석(1)

2020-09-02

머신러닝, Python, Machine Learning, Statistics

Page content

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

포지셔닝 분석 개요

마케팅에서 자주 보는 분석 방법중의 하나는 포지셔닝(Positioning) 기법이다.
포지셔닝 분석은 마케팅 통계분석 기법중의 하나로, 기업이나, 상품, 브랜드 같은 개체들의 포지셔닝을 수행하는 다차원 척도법(MDS: Multi-Dimensional Scaling)과 상응분석(Correspondence Analysis)이 있다.
위 두가지 분석 방법 중 무엇을 사용해야 할까?
- 만약 데이터셋이 주로 등간척도, 비율척도와 같이 구성되어 있다면 다차원 척도법
- 만약 데이터셋이 주로 명목척도, 서열척도와 같이 구성되어 있다면 상응분석
현재 삼성카드 대회의 주 데이터셋은 명목척도 및 서열척도로 구성되어 있기 때문에 상응분석으로 시작하면 된다.

상응분석

Correspondence Analysis는 범주형 변수(수준)들 간의 연관성을 분석한 후, 그 결과를 시각적 해석이 용이하도록 그래프화 하는 것임

(1) 기본 개념

상응분석을 사용하려면 빈도교차표를 만들어야 한다.
요약하면, 상응분석은 범주형 변수의 빈도를 나타내고 있는 빈도교차표의 행과 열(명목변수의 범주 값들)을 그래프상의 자극점 형태로 표시하는 방법.
이 때, 단순 상응분석은 2개의 변수, 다중 상응분석은 3개 이상의 변수 활용한다.
이 때, 상응분석은 카이제곱 검정과 같이 범주형 변수간의 상호연관성을 바탕으로 진행된다.
- 따라서, 두 개의 범주형 변수가 서로 연관성을 가지고 있다는 전제하에서 진행된다.

(2) 데이터 불러오기

먼저 필요한 데이터를 불러온다.
필자는 구글 코랩에서 데이터를 불러오기 때문에

%config InlineBackend.figure_format = 'retina'
!apt -qq -y install fonts-nanum

fonts-nanum is already the newest version (20170925-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.font_manager as fm

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

for font in fm.fontManager.ttflist:
    if 'Nanum' in font.name:
        print(font.name, font.fname)

NanumMyeongjo /usr/share/fonts/truetype/nanum/NanumMyeongjoBold.ttf
NanumSquareRound /usr/share/fonts/truetype/nanum/NanumSquareRoundR.ttf
NanumSquare /usr/share/fonts/truetype/nanum/NanumSquareB.ttf
NanumGothic /usr/share/fonts/truetype/nanum/NanumGothicBold.ttf
NanumMyeongjo /usr/share/fonts/truetype/nanum/NanumMyeongjo.ttf
NanumGothic /usr/share/fonts/truetype/nanum/NanumGothic.ttf
NanumBarunGothic /usr/share/fonts/truetype/nanum/NanumBarunGothicBold.ttf
NanumSquareRound /usr/share/fonts/truetype/nanum/NanumSquareRoundB.ttf
NanumBarunGothic /usr/share/fonts/truetype/nanum/NanumBarunGothic.ttf
NanumSquare /usr/share/fonts/truetype/nanum/NanumSquareR.ttf

fontpath = '/usr/share/fonts/truetype/nanum/NanumGothic.ttf'
font = fm.FontProperties(fname=fontpath, size=9)
plt.rc('font', family='NanumGothic') 
plt.rcParams["figure.figsize"] = (20, 10)
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.color'] = 'r'
plt.rcParams['axes.grid'] = True 
mpl.font_manager._rebuild()
fm._rebuild()

font = {'family' : 'NanumGothic',
        'weight' : 'bold',
        'size'   : 22}

plt.rc('font', **font)
plt.text(0.3, 0.3, '한글', size=100)

Text(0.3, 0.3, '한글')

png

# Mount Google Drive
from google.colab import drive # import drive from google colab

ROOT = "/content/drive"     # default location for the drive
print(ROOT)                 # print content of ROOT (Optional)
drive.mount(ROOT)           # we mount the google drive at /content/drive

/content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

# import join used to join ROOT path and MY_GOOGLE_DRIVE_PATH
from os.path import join  

# path to your project on Google Drive
MY_GOOGLE_DRIVE_PATH = 'My Drive/Colab Notebooks/inflearn/Python/Kaggle_Edu/05_statistics/'

PROJECT_PATH = join(ROOT, MY_GOOGLE_DRIVE_PATH)
print(PROJECT_PATH)

/content/drive/My Drive/Colab Notebooks/inflearn/Python/Kaggle_Edu/05_statistics/

%cd "{PROJECT_PATH}"

/content/drive/My Drive/Colab Notebooks/inflearn/Python/Kaggle_Edu/05_statistics

%ls

Chapter_05_01_positioning_analysis.ipynb  [0m[01;34mpython_stat[0m/

data = pd.read_csv("python_stat/Correspondence.csv", sep=",", encoding="CP949")
data.head()

	id	resort	slope	traffic	lodging	etc
0	1	대명	Slope-H	Traffic-H	Lodging-H	Etc-H
1	2	대명	Slope-H	Traffic-M	Lodging-H	Etc-L
2	3	대명	Slope-L	Traffic-M	Lodging-M	Etc-M
3	4	대명	Slope-L	Traffic-H	Lodging-M	Etc-M
4	5	대명	Slope-M	Traffic-M	Lodging-L	Etc-M

변수명 정리
- ID: 고객 고유 번호 (명목척도)
- Resort: 평가 리조트 이름 (명목척도)
- slope: 슬로프 난이도(명목척도)
- traffic: 교통 편의성(명목척도)
- lodging: 숙박 편의성(명목척도)
- etc: 기타시설 편의성(명목척도)
난이도 및 편의성은 모두 상중하로 구성되어 있다.

(3) 단순 상응분석의 개념적 분석 과정

본 포스트에서는 내용에 대해서는 디테일하게 다루지 않는다.
자세한 것은 Reference 교재를 확인한다.
첫째, 빈도교차표 형태의 데이터를 준비한다.
둘째, 기대 빈도교차표를 작성한다.
셋째, 카이제곱 행렬표를 작성한다.
넷째, 차원 좌표 생성한다.
다섯째, 마지막으로 좌표계 맵핑을 진행한다.

(3) 단순 상응분석: 교통 편의성 기준

먼저 교통 편의성을 기준으로 국내 각 리조트의 경쟁관계 혹은 포지셔닝을 살펴보자.
- crosstab 함수를 활용한다.
이 때, prince 모듈을 사용한다.
- prince

traffic = pd.crosstab(data.resort, data.traffic, margins=False)
print(traffic)

traffic  Traffic-H  Traffic-L  Traffic-M
resort                                  
대명               4          0          6
리솜               1          5          4
무주               0          6          4
용평               4          2          4
한화               2          2          6

단순 상응분석을 위해 이번에는 CA() 함수를 이용한다.

!pip install prince

Requirement already satisfied: prince in /usr/local/lib/python3.6/dist-packages (0.7.0)
Requirement already satisfied: pandas>=1.0.3 in /usr/local/lib/python3.6/dist-packages (from prince) (1.0.5)
Requirement already satisfied: scipy>=1.3.0 in /usr/local/lib/python3.6/dist-packages (from prince) (1.4.1)
Requirement already satisfied: matplotlib>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from prince) (3.2.2)
Requirement already satisfied: scikit-learn>=0.22.1 in /usr/local/lib/python3.6/dist-packages (from prince) (0.22.2.post1)
Requirement already satisfied: numpy>=1.17.1 in /usr/local/lib/python3.6/dist-packages (from prince) (1.18.5)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=1.0.3->prince) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=1.0.3->prince) (2018.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.2->prince) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.2->prince) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.2->prince) (0.10.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.22.1->prince) (0.16.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas>=1.0.3->prince) (1.15.0)

import prince

ca = prince.CA(n_components=2).fit(traffic)
print(ca)

CA(benzecri=False, check_input=True, copy=True, engine='auto', n_components=2,
   n_iter=10, random_state=None)

빈도교차표에서 행 변수인 resort와 열 변수인 slope의 차원 좌표를 출력한다.
시각화를 진행해본다.

# 폰트 세팅
font = {'family' : 'NanumGothic',
        'weight' : 'bold',
        'size'   : 22}

plt.rc('font', **font)

# 시각화
ax = ca.plot_coordinates(X = traffic, figsize=(20, 10))
ax.set_title("상응분석", fontsize = 24)

Text(0.5, 1.0, '상응분석')



/usr/local/lib/python3.6/dist-packages/matplotlib/backends/backend_agg.py:214: RuntimeWarning: Glyph 8722 missing from current font.
  font.set_text(s, 0.0, flags=flags)
/usr/local/lib/python3.6/dist-packages/matplotlib/backends/backend_agg.py:183: RuntimeWarning: Glyph 8722 missing from current font.
  font.set_text(s, 0, flags=flags)

png

맵을 직관적으로 해석하는 방법은 서로 가까이 있는 범주들은 서로 비슷하거나 연관성이 높은 것으로 판단하면 되고, 멀리 떨어진 범주들은 서로 다르거나 연관성이 적은 범주들로 해석한다.
상응분석 결과 교통 편의성이 상인 지역은 대명이었고, 교통편의성이 좋지 않은 지역에는 리솜, 무주가 포함되어 있었다.

(4) 단순 상응분석: 숙박 편의성 기준

이번에는 loading를 기준으로 단순 상응분석을 실시한다.

lodgings = pd.crosstab(data.resort, data.lodging, margins=False)
print(lodgings)

lodging  Lodging-H  Lodging-L  Lodging-M
resort                                  
대명               5          1          4
리솜               7          1          2
무주               0          5          5
용평               2          5          3
한화               3          3          4

ca = prince.CA(n_components=2).fit(lodgings)

# 시각화
ax = ca.plot_coordinates(X = lodgings, figsize=(20, 10))
ax.set_title("상응분석", fontsize = 24)

Text(0.5, 1.0, '상응분석')



/usr/local/lib/python3.6/dist-packages/matplotlib/backends/backend_agg.py:214: RuntimeWarning: Glyph 8722 missing from current font.
  font.set_text(s, 0.0, flags=flags)
/usr/local/lib/python3.6/dist-packages/matplotlib/backends/backend_agg.py:183: RuntimeWarning: Glyph 8722 missing from current font.
  font.set_text(s, 0, flags=flags)

png

위 결과를 보면 숙박편의성이 높은 리조트는 리솜으로 확인된다.
상대적으로 낮은 리조트는 용평으로 확인된다.

결론

범주형 데이터를 다루는 것은 쉽지 않다.
그러나, 실제 마케팅에서는 이러한 범주형 데이터가 많은 것 또한 사실이다.
단순하게, 막대 그래프를 그리는 것보다, 각 데이터의 차이를 앎으로써, 보다 직관적으로 해석해줄 수 있는 이러한 분석은 활용하기에 매우 좋다.

Reference

김형수(2020). Step by Step 파이썬 비즈니스 통계분석. 서울: 프레딕스.