Pandas Lambda Apply 함수 활용

2020-03-23

Page content

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

I. Iterrows, Itertuples 복습

이번 포스팅은 For-loop의 대안에 관한 함수 apply에 관한 내용이다. 본 포스트를 보고 학습하시기 전에 Pandas Iterrows 함수 활용과 Pandas Itertuples 함수 활용에서 학습 하기를 바란다.

지난시간과 마찬가지로 데이터는 동일한 것을 쓰도록 한다.

import pandas as pd
import io
import requests
import pprint

url = 'https://raw.githubusercontent.com/chloevan/datasets/master/sports/baseball_stats.csv'
url=requests.get(url).content
baseball_stats = pd.read_csv(io.StringIO(url.decode('utf-8')))

pprint.pprint(baseball_stats.head())

  Team League  Year   RS   RA   W   OBP   SLG    BA  Playoffs  RankSeason  \
0  ARI     NL  2012  734  688  81  0.33  0.42  0.26         0         NaN   
1  ATL     NL  2012  700  600  94  0.32  0.39  0.25         1         4.0   
2  BAL     AL  2012  712  705  93  0.31  0.42  0.25         1         5.0   
3  BOS     AL  2012  734  806  69  0.32  0.41  0.26         0         NaN   
4  CHC     NL  2012  613  759  61  0.30  0.38  0.24         0         NaN   

   RankPlayoffs    G  OOBP  OSLG  
0           NaN  162  0.32  0.41  
1           5.0  162  0.31  0.38  
2           4.0  162  0.32  0.40  
3           NaN  162  0.33  0.43  
4           NaN  162  0.34  0.42

II. 조건부 행 추출

드디어 Python 데이터 분석가로 보스턴 레드삭스(BOS)야구팀에 취직을 했다고 가정을 해보자. 단장이 2008 ~ 2010년까지의 득점과 실점의 차이를 보고 싶다고 요청을 해왔다. 이럴 때 어떻게 해야 할까?

bos_df = baseball_stats[baseball_stats.Team == "BOS"].reset_index(drop = True)
pprint.pprint(bos_df.head())

  Team League  Year   RS   RA   W   OBP   SLG    BA  Playoffs  RankSeason  \
0  BOS     AL  2012  734  806  69  0.32  0.41  0.26         0         NaN   
1  BOS     AL  2011  875  737  90  0.35  0.46  0.28         0         NaN   
2  BOS     AL  2010  818  744  89  0.34  0.45  0.27         0         NaN   
3  BOS     AL  2009  872  736  95  0.35  0.45  0.27         1         3.0   
4  BOS     AL  2008  845  694  95  0.36  0.45  0.28         1         3.0   

   RankPlayoffs    G  OOBP  OSLG  
0           NaN  162  0.33  0.43  
1           NaN  162  0.32  0.39  
2           NaN  162  0.33  0.40  
3           4.0  162  0.34  0.42  
4           3.0  162  0.32  0.39

이 때, 중요한 것 중의 하나는 .reset_index(drop = True) 활용법인데, 기존의 행 인덱스를 제거하고 0부터 다시 시작하는 것이 특징이다. 위 표에서 보는 것처럼 Team-BOS 데이터만 추출 된 것을 확인할 수 있다.

이제는 Year 컬럼에서 2008~2010년 데이터만 추출한다.

bos_year_df = bos_df[bos_df["Year"].isin([2009, 2010, 2011, 2012])].reset_index(drop = True)
pprint.pprint(bos_year_df.head())

  Team League  Year   RS   RA   W   OBP   SLG    BA  Playoffs  RankSeason  \
0  BOS     AL  2012  734  806  69  0.32  0.41  0.26         0         NaN   
1  BOS     AL  2011  875  737  90  0.35  0.46  0.28         0         NaN   
2  BOS     AL  2010  818  744  89  0.34  0.45  0.27         0         NaN   
3  BOS     AL  2009  872  736  95  0.35  0.45  0.27         1         3.0   

   RankPlayoffs    G  OOBP  OSLG  
0           NaN  162  0.33  0.43  
1           NaN  162  0.32  0.39  
2           NaN  162  0.33  0.40  
3           4.0  162  0.34  0.42

이번에 소개하는 함수는 .isin()인데, 연구자가 원하는 값만 알면 쉽게 추출할 수 있다는 장점이 있다. R을 사용하는 유저라면 %in% 함수를 기억할 것인데, 이와 매우 유사하다.

III. apply 함수

apply함수에는 반드시 특정함수(Specific Function)가 같이 데이터프레임 적용 및 사용이 된다.
이 때, axis에 숫자를 기입해야 하는데, 0일 경우에는 column 1일 경우네는 rows가 처리된다.
lambda function도 같이 사용된다.

(1) Column의 적용

먼저, 각각의 Column의 합계를 모으도록 한다. Column의 합계이니, sum함수가 필요하며, axis=0을 입력했다. 이 때 주의해야 할 것이 있다면, 각 함수가 사용되는 데이터타입에 맞춰서 데이터 처리가 선행이 되어야 한다. 먼저 전체 Column에 sum함수를 적용해보자. 에러가 날 것이지만, 어떻게 에러가 나는지 확인하는 것도 중요하다.

stat_totals = bos_year_df.apply(sum, axis=0)
print(stat_totals)

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-38-04a01c9fea4a> in <module>
----> 1 stat_totals = bos_year_df.apply(sum, axis=0)
      2 print(stat_totals)


/usr/local/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   6876             kwds=kwds,
   6877         )
-> 6878         return op.get_result()
   6879 
   6880     def applymap(self, func) -> "DataFrame":


/usr/local/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):


/usr/local/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
    294             try:
    295                 result = libreduction.compute_reduction(
--> 296                     values, self.f, axis=self.axis, dummy=dummy, labels=labels
    297                 )
    298             except ValueError as err:


pandas/_libs/reduction.pyx in pandas._libs.reduction.compute_reduction()


pandas/_libs/reduction.pyx in pandas._libs.reduction.Reducer.get_result()


TypeError: unsupported operand type(s) for +: 'int' and 'str'

에러가 usupported operand type(s) for +: 'int' and 'str' 인 것을 확인할 수 있다. 이제, 숫자형만 추출하도록 한다. 추출할 Column은 RS, RA, W, Playoffs이다.

bos_year_num_df = bos_year_df[['RS', 'RA', 'W', 'Playoffs']]
pprint.pprint(bos_year_num_df.head())

    RS   RA   W  Playoffs
0  734  806  69         0
1  875  737  90         0
2  818  744  89         0
3  872  736  95         1

stat_totals = bos_year_num_df.apply(sum, axis=0)
print(stat_totals)

RS          3299
RA          3023
W            343
Playoffs       1
dtype: int64

이렇게 각 컬럼의 전체 합계가 구해진 것을 확인할 수 있다.

(2) Row의 적용

이번에는 RS, RA만 합산하는 코드를 작성해본다. 이 때 중요한 것은 axis=1을 입력하는 것이다.

total_runs = bos_year_num_df[['RS', 'RA']].apply(sum, axis=1)
print(total_runs)

0    1540
1    1612
2    1562
3    1608
dtype: int64

각각의 합산된 결과값이 나왔다.

이번에는 playoffs의 조건에 따라 return 값을 바꾸는 것을 해본다. 이번에 나오는 소스코드는 데이터처리 할 때 자주 쓰는 구문 이므로 반드시 익히도록 한다.

먼저, 조건 함수를 작성한다.

def text_playoffs(num_playoffs): 
    if num_playoffs == 1:
        return 'Yes'
    else:
        return 'No'

각 행마다 1인지 0인지 확인 후 Text가 바뀌어야 하기 때문에, 이 때에는 .apply(lambda row: function(row["name_of_column"]), axis=1) 형태로 작성하도록 한다.

convert_playoffs = bos_year_num_df.apply(lambda row: text_playoffs(row['Playoffs']), axis=1)
print(convert_playoffs)

0     No
1     No
2     No
3    Yes
dtype: object

Numeric이 Character로 바뀌어진 것을 확인할 수 있다.

IV. apply 활용한 시즌별 승률 계산

이번에는 승률 함수(wp_calc)를 작성한 후, 팀의 승률을 계산한 것을 기존 데이터프레임(bos_year_df)에 추가하는 것을 작업을 해본다.

# 함수 정의
import numpy as np

def wp_cal(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

win_percs = bos_year_df.apply(lambda row: wp_cal(row['W'], row['G']), axis=1)
print(win_percs, '\n')

0    0.43
1    0.56
2    0.55
3    0.59
dtype: float64

# bos_year_df에 `WP` 칼럼 추가
bos_year_df['WP'] = win_percs

# 여기에서 승률이 0.5 이하인 것을 구하면
print(bos_year_df[bos_year_df['WP'] <= 0.50])

  Team League  Year   RS   RA   W   OBP   SLG    BA  Playoffs  RankSeason  \
0  BOS     AL  2012  734  806  69  0.32  0.41  0.26         0         NaN   

   RankPlayoffs    G  OOBP  OSLG    WP  
0           NaN  162  0.33  0.43  0.43

V. iterrows Vs. itertuples Vs. apply 의 속도 비교

이제 iterrows Vs itertuples Vs apply의 속도를 비교하는 코드를 작성해서, 향후에 어떤 구문을 쓰면 좋을지 고민해본다.

우선 데이터 관측치를 조금 늘려서 확인하도록 하겠다.

bos_df = baseball_stats[baseball_stats.Team == "BOS"].reset_index(drop = True)

(1) iterrows의 속도

%%timeit

# 함수 정의
def calc_diff(runs_scored, runs_allowed): # runs_scored: 득점 / runs_allowed: 실점
    run_diff = runs_scored - runs_allowed
    return run_diff

run_diffs = []
for i,row in bos_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    run_diff = calc_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

bos_df["RD"] = run_diffs

6.51 ms ± 413 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

(2) itertuples의 속도

%%timeit

# 함수 정의
def calc_diff(runs_scored, runs_allowed): # runs_scored: 득점 / runs_allowed: 실점
    run_diff = runs_scored - runs_allowed
    return run_diff

run_diffs = []
for row in bos_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    
    run_diff = calc_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

bos_df["RD"] = run_diffs

1.71 ms ± 99.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(3) apply의 속도

%%timeit

# 함수 정의
def calc_diff(runs_scored, runs_allowed): # runs_scored: 득점 / runs_allowed: 실점
    run_diff = runs_scored - runs_allowed
    return run_diff

run_diffs_apply = bos_df.apply(lambda row: calc_diff(row['RS'], row['RA']),axis=1)
bos_df['RD'] = run_diffs_apply

2.6 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

위 속도에서 볼 수 있듯이, apply의 함수가 iterrows의 속도가 약 5.00 ms 더 빠른 것을 확인 할 수 있다. 그러나 itertuples에 비해서는 꼭 엄청 빠르다고는 할 수 없다.

사실 이것은 조금 의외의 결과이기는 했다. 어떻게 받아 들여야 할지.. 그러나 확실한 것은 iterrows보다는 속도가 빠르다는 점과, itertuples보다는 코드가 훨씬 간결해졌다는 점은 apply의 함수가 보다 매력적인 것은 확신할 수 있다.

V. Reference

pandas.DataFrame.apply¶. (n.d.). Retrieved March 23, 2020, from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

End of Document