[Recommendation System] TMDB 5000 데이터를 이용한 콘텐츠 기반 필터링 실습

4 minute read

TMDB 5000 데이터를 이용한 콘텐츠 기반 필터링 실습

TMDB 5000은 캐글의 영화 데이터 세트입니다.

# 필요 라이브러리 import
import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')

1. 데이터 전처리

from ast import literal_eval

movies =pd.read_csv('assets/tmdb_5000_movies.csv')
movies_df = movies[['id','title', 'genres', 'vote_average', 'vote_count',
                    'popularity', 'keywords', 'overview']]
pd.set_option('max_colwidth', 100)
movies_df[['genres','keywords']][:1]

movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)

movies_df['genres'] = movies_df['genres'].apply(lambda x : [ y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [ y['name'] for y in x])

movies_df[['genres', 'keywords']].head(5)

	genres	keywords
0	[Action, Adventure, Fantasy, Science Fiction]	[culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa...
1	[Adventure, Fantasy, Action]	[ocean, drug abuse, exotic island, east india trading company, love of one's life, traitor, ship...
2	[Action, Adventure, Crime]	[spy, based on novel, secret agent, sequel, mi6, british secret service, united kingdom]
3	[Action, Crime, Drama, Thriller]	[dc comics, crime fighter, terrorist, secret identity, burglar, hostage drama, time bomb, gotham...
4	[Action, Adventure, Science Fiction]	[based on novel, mars, medallion, space travel, princess, alien, steampunk, martian, escape, edg...

# CountVectorizer를 적용하기 위해 공백문자로 word 단위가 구분되는 문자열로 변환 
from sklearn.feature_extraction.text import CountVectorizer

# 딕셔너리 형태를 리스트로 변환한 genres_literal 칼럼 생성
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x : (' ').join(x))
count_vect = CountVectorizer(min_df=0, ngram_range=(1,2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])

from sklearn.metrics.pairwise import cosine_similarity

# 코사인 유사도
genre_sim = cosine_similarity(genre_mat, genre_mat)
genre_sim_sorted_ind = genre_sim.argsort()[:, ::-1]
# 장르 유사도 리스트
print(genre_sim_sorted_ind[:1])

[[   0 3494  813 ... 3038 3037 2401]]

2. 코사인 유사도가 높은 영화 검색

def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    
    # 인자로 입력된 movies_df DataFrame에서 'title' 컬럼이 입력된 title_name 값인 DataFrame추출
    title_movie = df[df['title'] == title_name]
    
    # title_named을 가진 DataFrame의 index 객체를 ndarray로 반환하고 
    # sorted_ind 인자로 입력된 genre_sim_sorted_ind 객체에서 유사도 순으로 top_n 개의 index 추출
    title_index = title_movie.index.values
    similar_indexes = sorted_ind[title_index, :(top_n)]
    
    # 추출된 top_n index들 출력. top_n index는 2차원 데이터 임. 
    #dataframe에서 index로 사용하기 위해서 1차원 array로 변경
    print(similar_indexes)
    similar_indexes = similar_indexes.reshape(-1)
    
    return df.iloc[similar_indexes]

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)

similar_movies

	id	title	genres	vote_average	vote_count	popularity	keywords	overview	genres_literal	weighted_vote
2731	240	The Godfather: Part II	[Drama, Crime]	8.3	3338	105.792936	[italo-american, cuba, vororte, melancholy, praise, revenge, mafia, lawyer, blood, corrupt polit...	In the continuing saga of the Corleone crime family, a young Vito Corleone grows up in Sicily an...	Drama Crime	8.079586
1847	769	GoodFellas	[Drama, Crime]	8.2	3128	63.654244	[prison, based on novel, florida, 1970s, mass murder, irish-american, drug traffic, biography, b...	The true story of Henry Hill, a half-Irish, half-Sicilian Brooklyn kid who is adopted by neighbo...	Drama Crime	7.976937
3866	598	City of God	[Drama, Crime]	8.1	1814	44.356711	[male nudity, street gang, brazilian, photographer, 1970s, puberty, ghetto, gang war, coming of ...	Cidade de Deus is a shantytown that started during the 1960s and became one of Rio de Janeiro’s ...	Drama Crime	7.759693
1663	311	Once Upon a Time in America	[Drama, Crime]	8.2	1069	49.336397	[life and death, corruption, street gang, rape, sadistic, lovesickness, sexual abuse, money laun...	A former Prohibition-era Jewish gangster returns to the Lower East Side of Manhattan over thirty...	Drama Crime	7.657811
883	640	Catch Me If You Can	[Drama, Crime]	7.7	3795	73.944049	[con man, biography, fbi agent, overhead camera shot, attempted jailbreak, engagement party, mis...	A true story about Frank Abagnale Jr. who, before his 19th birthday, successfully conned million...	Drama Crime	7.557097
281	4982	American Gangster	[Drama, Crime]	7.4	1502	42.361215	[underdog, black people, drug traffic, drug smuggle, society, ambition, rise and fall, cop, drug...	Following the death of his employer and mentor, Bumpy Johnson, Frank Lucas establishes himself a...	Drama Crime	7.141396
4041	11798	This Is England	[Drama, Crime]	7.4	363	8.395624	[holiday, skinhead, england, vandalism, independent film, gang, racism, summer, youth, violence,...	A story about a troubled boy growing up in England, set in 1983. He comes across a few skinheads...	Drama Crime	6.739664
1149	168672	American Hustle	[Drama, Crime]	6.8	2807	49.664128	[con artist, scam, mobster, fbi agent]	A con man, Irving Rosenfeld, along with his seductive partner Sydney Prosser, is forced to work ...	Drama Crime	6.717525
1243	203	Mean Streets	[Drama, Crime]	7.2	345	17.002096	[epilepsy, protection money, secret love, money, redemption]	A small-time hood must choose from among love, friendship and the chance to rise within the mob.	Drama Crime	6.626569
2839	10220	Rounders	[Drama, Crime]	6.9	439	18.422008	[gambling, law, compulsive gambling, roulette, gain]	A young man is a reformed gambler who must return to playing big stakes poker to help a friend p...	Drama Crime	6.530427

# 유사도를 나타내는 특징 컬럼을 데이터프레임에 추가

movies_df[['title','vote_average','vote_count']].sort_values('vote_average', ascending=False)[:10]
C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)
percentile = 0.6
m = movies_df['vote_count'].quantile(percentile)
C = movies_df['vote_average'].mean()

3. 최종 유사도 계산

def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']
    
    return ( (v/(v+m)) * R ) + ( (m/(m+v)) * C )

# 유사도 가중치를 나타내는 특징 컬럼을 데이터프레임에 추가

movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis=1) 
movies_df[['title','vote_average','weighted_vote','vote_count']].sort_values('weighted_vote',
                                                                          ascending=False)[:10]

	title	vote_average	weighted_vote	vote_count
1881	The Shawshank Redemption	8.5	8.396052	8205
3337	The Godfather	8.4	8.263591	5893
662	Fight Club	8.3	8.216455	9413
3232	Pulp Fiction	8.3	8.207102	8428
65	The Dark Knight	8.2	8.136930	12002
1818	Schindler's List	8.3	8.126069	4329
3865	Whiplash	8.3	8.123248	4254
809	Forrest Gump	8.2	8.105954	7927
2294	Spirited Away	8.3	8.105867	3840
2731	The Godfather: Part II	8.3	8.079586	3338

4. 장르 유사성이 높은 영화 추천

def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie = df[df['title'] == title_name]
    title_index = title_movie.index.values
    
    # top_n의 2배에 해당하는 쟝르 유사성이 높은 index 추출 
    similar_indexes = sorted_ind[title_index, :(top_n*2)]
    similar_indexes = similar_indexes.reshape(-1)
    # 기준 영화 index는 제외
    similar_indexes = similar_indexes[similar_indexes != title_index]
    
    # top_n의 2배에 해당하는 후보군에서 weighted_vote 높은 순으로 top_n 만큼 추출 
    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending=False)[:top_n]

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)
similar_movies[['title', 'vote_average', 'weighted_vote']]

	title	vote_average	weighted_vote
2731	The Godfather: Part II	8.3	8.079586
1847	GoodFellas	8.2	7.976937
3866	City of God	8.1	7.759693
1663	Once Upon a Time in America	8.2	7.657811
883	Catch Me If You Can	7.7	7.557097
281	American Gangster	7.4	7.141396
4041	This Is England	7.4	6.739664
1149	American Hustle	6.8	6.717525
1243	Mean Streets	7.2	6.626569
2839	Rounders	6.9	6.530427

Share on

Twitter Facebook LinkedIn

wowo0709

[Recommendation System] TMDB 5000 데이터를 이용한 콘텐츠 기반 필터링 실습

TMDB 5000 데이터를 이용한 콘텐츠 기반 필터링 실습

1. 데이터 전처리

2. 코사인 유사도가 높은 영화 검색

3. 최종 유사도 계산

4. 장르 유사성이 높은 영화 추천

Share on

Leave a comment

You may also enjoy

[Python] Effective Python CH 2. 리스트와 딕셔너리 - 1

[Python] Effective Python CH 1. 파이썬답게 생각하기 - 2

[Python] Effective Python CH 1. 파이썬답게 생각하기 - 1

[Python] Effective Python 전체 목차