추천 시스템 개요 및 이론, Baseline Code

2020-06-21

Python Project, Recommendation, 추천시스템, Python

Page content

I. 개요

대고객 대상으로 한 대부분의 플랫폼 서비스 업체들은 고객 개개인에게 맞춤형의 추천 서비스를 도입하고 있음
- 전자상거래 업체, 유투브, 애플 뮤직 등
ML의 여러 알고리즘 중 비즈니스 관점에 부합하는 기법이 추천 시스템.
추천 시스템의 진정한 묘미는 사용자 본인도 모르는 취향 발견, 재구매로 연결하도록 설계
누가 필요할까?
- 모든 플랫폼 서비스
- 이유1: 플랫폼은 다수의 판매자와 소비자를 필요로 함, 문제는 카테고리와 메뉴구성이 복잡해지면 소비자의 제품 선택에 부작용
- 이유2: 만족도가 떨어지면 고객은 그 플랫폼을 떠날 가능성이 크며, 이는 플랫폼 서비스의 매출 하락과 직결
- 모든 플랫폼 서비스는 기본적으로 추천서비스를 장착하고 싶어함
영화 데이터를 기준으로 추천시스템을 단계별로 구현함을 목표로 함

II. 추천시스템의 유형 및 역사

추천시스템의 유형과 간단한 역사에 대해 배워보도록 한다.

(1) 유형

크게 세가지로 구분됨.
- Demographic Filtering
- 콘텐츠 기반 필터링 (Content Filtering)
- 협업 필터링 (Collaborative Filtering)
  - 최근접 이웃(Nearest Neighbor)
  - 잠재 요인(Latent Factor)

(2) 역사

초창기: 콘텐츠 기반 필터링 또는 최근접 이웃 기반 협업 필터링이 주로 사용됨.
중기: 넷플릭스 추천 시스템 경연 대회에서 행렬 분해 (Matrix Factorization) 기법을 이용한 잠재요인 협업 필터링 방식으로 우승한 뒤, 유명해짐.
최근: 개인화 특성을 강화하기 위해서 하이브리드 형식으로 콘텐츠 기반과 협업 기반을 적절히 결합해 사용하는 경우도 늘고 있음

III. Demographic Filtering

가장 기초적이면서 Simple한 추천시스템 방식
같은 영화를 인구통계학에 기반하여 각 사용자에게 추천
그런데, 영화추천 시, 주로 인기도가 높은 대중적인 영화 위주 (예를 들면 Top 250개)만 선별하여 각 사용자에게 전달. 대중적인 영화들은 영화를 보지 않는 다른 일반 관객들에게게 호감을 가질 가능성이 더 높을 것으로 추정
필요조건
- 영화 평점 점수
- 평점 점수 기반 영화 정렬

(1) 데이터 수집

데이터는 kaggle에서 가져왔다.
- https://www.kaggle.com/rounakbanik/the-movies-dataset/data

(2) 데이터 설명

여러 데이터들이 있는데, 관련 내용은 캐글 본문의 것을 그대로 사용한다.
- movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.
- keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
- credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.
- links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
- links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.
- ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

(3) 구글 드라이브와 연동

구글 드라이브와 연동하여 pandas를 활용하여 데이터를 수집한다.
구글 드라이브와 연동하는 방법에 대해서는 Colab + Drive + Github Workflow에서 확인한다.

from google.colab import drive # 패키지 불러오기 
from os.path import join  

ROOT = "/content/drive"     # 드라이브 기본 경로
print(ROOT)                 # print content of ROOT (Optional)
drive.mount(ROOT)           # 드라이브 기본 경로 Mount

MY_GOOGLE_DRIVE_PATH = 'My Drive/Colab Notebooks/your/path' # 프로젝트 경로
PROJECT_PATH = join(ROOT, MY_GOOGLE_DRIVE_PATH) # 프로젝트 경로
print(PROJECT_PATH)

/content/drive
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········

%cd "{PROJECT_PATH}"

!ls

data  source

import pandas as pd
import pandas_profiling
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns

from IPython.core.display import display, HTML
from pandas_profiling import ProfileReport

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

import pandas as pd
metadata = pd.read_csv('data/movies_metadata.csv', low_memory=False)
metadata.head(3)

	adult	belongs_to_collection	budget	genres	homepage	id	imdb_id	original_language	original_title	overview	popularity	poster_path	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.946943	/rhIRbceoE9lR4veEXuwCC2wARtG.jpg	[{'name': 'Pixar Animation Studios', 'id': 3}]	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0
1	False	NaN	65000000	[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	17.015539	/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg	[{'name': 'TriStar Pictures', 'id': 559}, {'na...	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-12-15	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0
2	False	{'id': 119050, 'name': 'Grumpy Old Men Collect...	0	[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...	NaN	15602	tt0113228	en	Grumpier Old Men	A family wedding reignites the ancient feud be...	11.7129	/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg	[{'name': 'Warner Bros.', 'id': 6194}, {'name'...	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-12-22	0.0	101.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Still Yelling. Still Fighting. Still Ready for...	Grumpier Old Men	False	6.5	92.0

(4) 평점 가중치 함수 구현

여기에서 주목해야 하는 것이 있는데, vote_average와 vote_count를 유심히 봐야한다.
3명이 투표해서 얻은 평점 8.7의 영화 Vs. 100명이 투표해서 얻은 평점 8.0의 영화 중 어느 것이 더 좋다고 볼 수 있을까?
이러한 부분을 고려하여 가중치 공식을 만든다. (wr:Weighted Rating)

$$ wr = \left (\frac{\nu }{\nu+\mu } \times R \right ) + \left (\frac{\nu }{\nu+\mu } \times C \right ) $$

$\nu$ 는 영화에 투표한 수
$\mu$ 는 반드시 투표해야 최소한의 투표 수
R은 영화의 평균 평점
C은 전체 영화의 평균 투표 수
위 공식을 근거로 함수를 만든다.
- vote_average의 전체 평균을 구해서 변수 C에 할당한다.
- vote_count의 값에 사분위 함수를 활용하여 0.90에 해당하는 지점을 최소지점을 생각한다.
- 즉, m은 일종의 cut-off 지점이다.

C = metadata['vote_average'].mean()
print(C)

5.618207215133889

m = metadata['vote_count'].quantile(0.90)
print(m)

160.0

def weighted_rating(x, m=m, C=C):
    '''
    Returns Weighted Rating

    Parameters
    ----------
    x : DataFrame 
      vote_count >= m 
    
    m : int
      mininum votes 
    
    C : int
      mean of vote average column

    Returns
    ----------
    The result that computes the weighted average on the IMDB formula
    '''
    v = x['vote_count']
    R = x['vote_average']

    return (v/(v+m) * R) + (m/(m+v) * C)

이 때 movies_df를 데이터를 새로 저장하는데, vote_count을 기준으로 m 이하인 것은 제거한다.
그러면 데이터가 약 10%만 남게 된다.

movies_df = metadata.copy().loc[metadata['vote_count'] >= m] 
movies_df.shape

(4555, 24)

metadata.shape

(45466, 24)

(5) 추천영화 구하기

평점 가중치 함수를 활용하여 score feature를 새롭게 만들어보자.
또한, score 점수를 기반으로 내림차순으로 재정렬한뒤, 다시 상위 10개의 영화를 추출하면 그 영화가 추천 영화가 된다.

movies_df['score'] = movies_df.apply(weighted_rating, axis=1)
movies_df = movies_df.sort_values('score', ascending=False)
movies_df[['title', 'vote_count', 'vote_average', 'score']].head(10)

                             title  vote_count  vote_average     score
314       The Shawshank Redemption      8358.0           8.5  8.445869
834                  The Godfather      6024.0           8.5  8.425439
10309  Dilwale Dulhania Le Jayenge       661.0           9.1  8.421453
12481              The Dark Knight     12269.0           8.3  8.265477
2843                    Fight Club      9678.0           8.3  8.256385
292                   Pulp Fiction      8670.0           8.3  8.251406
522               Schindler's List      4436.0           8.3  8.206639
23673                     Whiplash      4376.0           8.3  8.205404
5481                 Spirited Away      3968.0           8.3  8.196055
2211             Life Is Beautiful      3643.0           8.3  8.187171

매우 쉽게 접근할 수 있지만, 개개인의 취향에 맞춰서 추천이 되는 시스템은 아니다.

IV. Reference

Ibtesama. (2020, May 16). Getting Started with a Movie Recommendation System. Retrieved June 21, 2020, from https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system