Pandas 속도 비교 - loc vs replace(2)

2021-05-20

Python, Pandas

Page content

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

1줄 요약

값을 변경할 때에는 .replace 메서드를 사용합니다.

개요

Replace 속도를 측정해보자.
이번에는 multiple 값을 변경하는 방법에 대해 알아봅니다.

비교 1 `.loc` vs `.replace`

값을 바꾸는 방법은 다음과 같다.
- data['column'].loc[data['column'] == 'Old Value'] = 'New Value'

import pandas as pd
import seaborn as sns
diamonds = sns.load_dataset('diamonds')
print(diamonds)

       carat        cut color clarity  depth  table  price     x     y     z
0       0.23      Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1       0.21    Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2       0.23       Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3       0.29    Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4       0.31       Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
...      ...        ...   ...     ...    ...    ...    ...   ...   ...   ...
53935   0.72      Ideal     D     SI1   60.8   57.0   2757  5.75  5.76  3.50
53936   0.72       Good     D     SI1   63.1   55.0   2757  5.69  5.75  3.61
53937   0.70  Very Good     D     SI1   62.8   60.0   2757  5.66  5.68  3.56
53938   0.86    Premium     H     SI2   61.0   58.0   2757  6.15  6.12  3.74
53939   0.75      Ideal     D     SI2   62.2   55.0   2757  5.83  5.87  3.64

[53940 rows x 10 columns]

diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB

비교 2. `.loc` vs `.replace`

cut Column에 있는 값 중, Premium과 Ideal 모두 Best로 변경합니다.

import time

diamonds = sns.load_dataset('diamonds')
diamonds['cut'] = diamonds['cut'].astype('object')

start_time = time.time()
diamonds['cut'].loc[(diamonds['cut'] == 'Premium') | (diamonds['cut'] == 'Ideal')] = 'Best'
print('Time using .loc[]: {} sec'.format(time.time() - start_time))
print(diamonds)

Time using .loc[]: 0.008001089096069336 sec
       carat        cut color clarity  depth  table  price     x     y     z
0       0.23       Best     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1       0.21       Best     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2       0.23       Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3       0.29       Best     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4       0.31       Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
...      ...        ...   ...     ...    ...    ...    ...   ...   ...   ...
53935   0.72       Best     D     SI1   60.8   57.0   2757  5.75  5.76  3.50
53936   0.72       Good     D     SI1   63.1   55.0   2757  5.69  5.75  3.61
53937   0.70  Very Good     D     SI1   62.8   60.0   2757  5.66  5.68  3.56
53938   0.86       Best     H     SI2   61.0   58.0   2757  6.15  6.12  3.74
53939   0.75       Best     D     SI2   62.2   55.0   2757  5.83  5.87  3.64

[53940 rows x 10 columns]


/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexing.py:1636: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

이번에는 replace 메서드를 사용해본다.
- data['column'].replace('old value', 'new value', inplace = True

diamonds = sns.load_dataset('diamonds')

start_time = time.time()
diamonds.replace(['Premium', 'Ideal'], 'Best', inplace=True)
print('Time using replace(): {} sec'.format(time.time() - start_time))

Time using replace(): 0.0011608600616455078 sec

기존 코드에서, Good과 Very Good을 Low로 변경코드를 추가합니다.

diamonds = sns.load_dataset('diamonds')
diamonds['cut'] = diamonds['cut'].astype('object')

start_time = time.time()
diamonds['cut'].loc[(diamonds['cut'] == 'Premium') | \
                    (diamonds['cut'] == 'Ideal')] = 'Best'
diamonds['cut'].loc[(diamonds['cut'] == 'Very Good') | \
                    (diamonds['cut'] == 'Good')] = 'Low'
print('Time using .loc[]: {} sec'.format(time.time() - start_time))
print(diamonds)

Time using .loc[]: 0.013423681259155273 sec
       carat   cut color clarity  depth  table  price     x     y     z
0       0.23  Best     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1       0.21  Best     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2       0.23   Low     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3       0.29  Best     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4       0.31   Low     J     SI2   63.3   58.0    335  4.34  4.35  2.75
...      ...   ...   ...     ...    ...    ...    ...   ...   ...   ...
53935   0.72  Best     D     SI1   60.8   57.0   2757  5.75  5.76  3.50
53936   0.72   Low     D     SI1   63.1   55.0   2757  5.69  5.75  3.61
53937   0.70   Low     D     SI1   62.8   60.0   2757  5.66  5.68  3.56
53938   0.86  Best     H     SI2   61.0   58.0   2757  6.15  6.12  3.74
53939   0.75  Best     D     SI2   62.2   55.0   2757  5.83  5.87  3.64

[53940 rows x 10 columns]


/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexing.py:1636: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

diamonds = sns.load_dataset('diamonds')

start_time = time.time()
diamonds.replace([['Premium', 'Ideal'], ['Very Good', 'Good']], ['Best', 'Low'], inplace=True)
print('Time using replace(): {} sec'.format(time.time() - start_time)) 
print(diamonds)

Time using replace(): 0.002335071563720703 sec
       carat   cut color clarity  depth  table  price     x     y     z
0       0.23  Best     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1       0.21  Best     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2       0.23   Low     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3       0.29  Best     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4       0.31   Low     J     SI2   63.3   58.0    335  4.34  4.35  2.75
...      ...   ...   ...     ...    ...    ...    ...   ...   ...   ...
53935   0.72  Best     D     SI1   60.8   57.0   2757  5.75  5.76  3.50
53936   0.72   Low     D     SI1   63.1   55.0   2757  5.69  5.75  3.61
53937   0.70   Low     D     SI1   62.8   60.0   2757  5.66  5.68  3.56
53938   0.86  Best     H     SI2   61.0   58.0   2757  6.15  6.12  3.74
53939   0.75  Best     D     SI2   62.2   55.0   2757  5.83  5.87  3.64

[53940 rows x 10 columns]

강의 홍보

1줄 요약

개요

비교 1 .loc vs .replace

비교 2. .loc vs .replace

비교 1 `.loc` vs `.replace`

비교 2. `.loc` vs `.replace`