Pandas Itertuples 함수 활용

2020-03-22

Data Transformation, Python, 데이터 전처리

Page content

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

I. Iterrows

이번 포스팅은 Iterrows()의 확장개념입니다. 본 포스트를 보고 학습하시기 전에 Pandas Iterrows 함수 활용에서 학습 하기를 바란다.

II. Itertuples의 개념

itertuples()는 기본적으로 iterrows() 함수보다는 빠르다.

import pandas as pd
import io
import requests
import pprint

url = 'https://raw.githubusercontent.com/chloevan/datasets/master/sports/baseball_stats.csv'
url=requests.get(url).content
baseball_stats = pd.read_csv(io.StringIO(url.decode('utf-8')))

pprint.pprint(baseball_stats.head())

  Team League  Year   RS   RA   W   OBP   SLG    BA  Playoffs  RankSeason  \
0  ARI     NL  2012  734  688  81  0.33  0.42  0.26         0         NaN   
1  ATL     NL  2012  700  600  94  0.32  0.39  0.25         1         4.0   
2  BAL     AL  2012  712  705  93  0.31  0.42  0.25         1         5.0   
3  BOS     AL  2012  734  806  69  0.32  0.41  0.26         0         NaN   
4  CHC     NL  2012  613  759  61  0.30  0.38  0.24         0         NaN   

   RankPlayoffs    G  OOBP  OSLG  
0           NaN  162  0.32  0.41  
1           5.0  162  0.31  0.38  
2           4.0  162  0.32  0.40  
3           NaN  162  0.33  0.43  
4           NaN  162  0.34  0.42

III. 조건부 행 추출

드디어 Python 데이터 분석가로 보스턴 레드박스(BOS)야구팀에 취직을 했다고 가정을 해보자. 단장이 2008 ~ 2010년까지의 득점과 실점의 차이를 보고 싶다고 요청을 해왔다. 이럴 때 어떻게 해야 할까?

bos_df = baseball_stats[baseball_stats.Team == "BOS"].reset_index(drop = True)
pprint.pprint(bos_df.head())

  Team League  Year   RS   RA   W   OBP   SLG    BA  Playoffs  RankSeason  \
0  BOS     AL  2012  734  806  69  0.32  0.41  0.26         0         NaN   
1  BOS     AL  2011  875  737  90  0.35  0.46  0.28         0         NaN   
2  BOS     AL  2010  818  744  89  0.34  0.45  0.27         0         NaN   
3  BOS     AL  2009  872  736  95  0.35  0.45  0.27         1         3.0   
4  BOS     AL  2008  845  694  95  0.36  0.45  0.28         1         3.0   

   RankPlayoffs    G  OOBP  OSLG  
0           NaN  162  0.33  0.43  
1           NaN  162  0.32  0.39  
2           NaN  162  0.33  0.40  
3           4.0  162  0.34  0.42  
4           3.0  162  0.32  0.39

이 때, 중요한 것 중의 하나는 .reset_index(drop = True) 활용법인데, 기존의 행 인덱스를 제거하고 0부터 다시 시작하는 것이 특징이다. 위 표에서 보는 것처럼 Team-BOS 데이터만 추출 된 것을 확인할 수 있다.

이제는 Year 컬럼에서 2008~2010년 데이터만 추출한다.

bos_year_df = bos_df[bos_df["Year"].isin([2008, 2009, 2010])].reset_index(drop = True)
pprint.pprint(bos_year_df.head())

  Team League  Year   RS   RA   W   OBP   SLG    BA  Playoffs  RankSeason  \
0  BOS     AL  2010  818  744  89  0.34  0.45  0.27         0         NaN   
1  BOS     AL  2009  872  736  95  0.35  0.45  0.27         1         3.0   
2  BOS     AL  2008  845  694  95  0.36  0.45  0.28         1         3.0   

   RankPlayoffs    G  OOBP  OSLG  
0           NaN  162  0.33  0.40  
1           4.0  162  0.34  0.42  
2           3.0  162  0.32  0.39

이번에 소개하는 함수는 .isin()인데, 연구자가 원하는 값만 알면 쉽게 추출할 수 있다는 장점이 있다. R을 사용하는 유저라면 %in% 함수를 기억할 것인데, 이와 매우 유사하다.

IV. itertuples의 구조

itertuples의 구조는 아래와 같다.

for row in bos_year_df.itertuples():
  print(row)

Pandas(Index=0, Team='BOS', League='AL', Year=2010, RS=818, RA=744, W=89, OBP=0.34, SLG=0.45, BA=0.27, Playoffs=0, RankSeason=nan, RankPlayoffs=nan, G=162, OOBP=0.33, OSLG=0.4)
Pandas(Index=1, Team='BOS', League='AL', Year=2009, RS=872, RA=736, W=95, OBP=0.35, SLG=0.45, BA=0.27, Playoffs=1, RankSeason=3.0, RankPlayoffs=4.0, G=162, OOBP=0.34, OSLG=0.42)
Pandas(Index=2, Team='BOS', League='AL', Year=2008, RS=845, RA=694, W=95, OBP=0.36, SLG=0.45, BA=0.28, Playoffs=1, RankSeason=3.0, RankPlayoffs=3.0, G=162, OOBP=0.32, OSLG=0.39)

iterrows의 Return값이 Series형태인 것에 비해, itertuples의 Return값은 Pandas형으로 출력 되었다. 이러한 성질을 이용해서, 이번에는 Index, Year, G, W, Playoffs을 각각 가져오는 For-loop문을 작성해본다.

for row in bos_year_df.itertuples():
  i = row.Index
  year = row.Year
  games = row.G
  wins = row.W
  playoffs = row.Playoffs
  print(i, year, games, wins, playoffs)

0 2010 162 89 0
1 2009 162 95 1
2 2008 162 95 1

이제 여기에서 playoffs=1 인 조건을 줘서 데이터를 출력하도록 해본다. 여기에서 1은 playoff에 진출했다는 Yes의 의미이고 0은 진출하지 못했다는 뜻을 의미한다.

for row in bos_year_df.itertuples():
  i = row.Index
  year = row.Year
  games = row.G
  wins = row.W
  playoffs = row.Playoffs

  if row.Playoffs == 1:
    print(i, year, games, wins, playoffs)

1 2009 162 95 1
2 2008 162 95 1

V. itertuples을 활용한 득점과 실점 계산

이번에는 득점과 실점을 계산한 후 기존 데이터에 다시 추가하는 코드를 작성해본다.

# 함수 정의
def calc_diff(runs_scored, runs_allowed): # runs_scored: 득점 / runs_allowed: 실점
    run_diff = runs_scored - runs_allowed
    return run_diff

run_diffs = []
for row in bos_year_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    
    run_diff = calc_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

bos_year_df["RD"] = run_diffs
pprint.pprint(bos_year_df.head())

  Team League  Year   RS   RA   W   OBP   SLG    BA  Playoffs  RankSeason  \
0  BOS     AL  2010  818  744  89  0.34  0.45  0.27         0         NaN   
1  BOS     AL  2009  872  736  95  0.35  0.45  0.27         1         3.0   
2  BOS     AL  2008  845  694  95  0.36  0.45  0.28         1         3.0   

   RankPlayoffs    G  OOBP  OSLG   RD  
0           NaN  162  0.33  0.40   74  
1           4.0  162  0.34  0.42  136  
2           3.0  162  0.32  0.39  151

VI. iterrows vs itertuples의 속도 비교

이제 iterrows Vs itertuples의 속도를 비교하는 코드를 작성해서, 향후에 어떤 구문을 쓰면 좋을지 고민해본다.

우선 데이터 관측치를 조금 늘려서 확인하도록 하겠다.

bos_df = baseball_stats[baseball_stats.Team == "BOS"].reset_index(drop = True)

(1) iterrows의 속도

%%timeit

# 함수 정의
def calc_diff(runs_scored, runs_allowed): # runs_scored: 득점 / runs_allowed: 실점
    run_diff = runs_scored - runs_allowed
    return run_diff

run_diffs = []
for i,row in bos_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    run_diff = calc_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

bos_df["RD"] = run_diffs

6.47 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

(2) itertuples의 속도

%%timeit

# 함수 정의
def calc_diff(runs_scored, runs_allowed): # runs_scored: 득점 / runs_allowed: 실점
    run_diff = runs_scored - runs_allowed
    return run_diff

run_diffs = []
for row in bos_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    
    run_diff = calc_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

bos_df["RD"] = run_diffs

1.57 ms ± 37.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

위 속도에서 볼 수 있듯이, itertuples의 속도가 iterrows의 속도보다 약 5.00 ms 더 빠른 것을 확인 할 수 있다.

다음시간에는 apply 함수에 대해 배워보는 시간을 갖도록 한다.

VII. Reference

pandas.DataFrame.itertuples¶. (n.d.). Retrieved March 22, 2020, from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html

End of Document