Pandas 속도 비교 - loc vs replace(2)
Page content
강의 홍보
- 취준생을 위한 강의를 제작하였습니다.
- 본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
스타벅스 아이스 아메리카노를 선물
로 보내드리겠습니다.
- [비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기
1줄 요약
- 값을 변경할 때에는
.replace
메서드를 사용합니다.
개요
Replace
속도를 측정해보자.- 이번에는 multiple 값을 변경하는 방법에 대해 알아봅니다.
비교 1 .loc
vs .replace
- 값을 바꾸는 방법은 다음과 같다.
data['column'].loc[data['column'] == 'Old Value'] = 'New Value'
import pandas as pd
import seaborn as sns
diamonds = sns.load_dataset('diamonds')
print(diamonds)
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64
[53940 rows x 10 columns]
diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null category
2 color 53940 non-null category
3 clarity 53940 non-null category
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB
비교 2. .loc
vs .replace
cut
Column에 있는 값 중,Premium
과Ideal
모두Best
로 변경합니다.
import time
diamonds = sns.load_dataset('diamonds')
diamonds['cut'] = diamonds['cut'].astype('object')
start_time = time.time()
diamonds['cut'].loc[(diamonds['cut'] == 'Premium') | (diamonds['cut'] == 'Ideal')] = 'Best'
print('Time using .loc[]: {} sec'.format(time.time() - start_time))
print(diamonds)
Time using .loc[]: 0.008001089096069336 sec
carat cut color clarity depth table price x y z
0 0.23 Best E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Best E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Best I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Best D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Best H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Best D SI2 62.2 55.0 2757 5.83 5.87 3.64
[53940 rows x 10 columns]
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexing.py:1636: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
- 이번에는 replace 메서드를 사용해본다.
data['column'].replace('old value', 'new value', inplace = True
diamonds = sns.load_dataset('diamonds')
start_time = time.time()
diamonds.replace(['Premium', 'Ideal'], 'Best', inplace=True)
print('Time using replace(): {} sec'.format(time.time() - start_time))
Time using replace(): 0.0011608600616455078 sec
- 기존 코드에서,
Good
과Very Good
을Low
로 변경코드를 추가합니다.
diamonds = sns.load_dataset('diamonds')
diamonds['cut'] = diamonds['cut'].astype('object')
start_time = time.time()
diamonds['cut'].loc[(diamonds['cut'] == 'Premium') | \
(diamonds['cut'] == 'Ideal')] = 'Best'
diamonds['cut'].loc[(diamonds['cut'] == 'Very Good') | \
(diamonds['cut'] == 'Good')] = 'Low'
print('Time using .loc[]: {} sec'.format(time.time() - start_time))
print(diamonds)
Time using .loc[]: 0.013423681259155273 sec
carat cut color clarity depth table price x y z
0 0.23 Best E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Best E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Low E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Best I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Low J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Best D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Low D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Low D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Best H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Best D SI2 62.2 55.0 2757 5.83 5.87 3.64
[53940 rows x 10 columns]
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexing.py:1636: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
diamonds = sns.load_dataset('diamonds')
start_time = time.time()
diamonds.replace([['Premium', 'Ideal'], ['Very Good', 'Good']], ['Best', 'Low'], inplace=True)
print('Time using replace(): {} sec'.format(time.time() - start_time))
print(diamonds)
Time using replace(): 0.002335071563720703 sec
carat cut color clarity depth table price x y z
0 0.23 Best E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Best E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Low E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Best I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Low J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Best D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Low D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Low D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Best H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Best D SI2 62.2 55.0 2757 5.83 5.87 3.64
[53940 rows x 10 columns]