강의 홍보
I. Kaggle에서 타이타닉 데이터 가져오기
Requirement already satisfied: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.6)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.12.0)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.8.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.41.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.0.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2020.6.20)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.9)
- kaggle 인증키를 업로드 하여 권한 부여 한다.
from google.colab import files
files.upload()
Saving kaggle.json to kaggle.json
{'kaggle.json': b'{"username":"j2hoon85","key":"5a23c8dba5a151100b483a587eafdac8"}'}
!mkdir -p ~/.kaggle # 파일 생성
!mv kaggle.json ~/.kaggle/ # kaggle.json 파일 이동
!chmod 600 ~/.kaggle/kaggle.json # 권한 부여
!kaggle competitions list
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.6 / client 1.5.4)
ref deadline category reward teamCount userHasEntered
--------------------------------------------- ------------------- --------------- --------- --------- --------------
tpu-getting-started 2030-06-03 23:59:00 Getting Started Kudos 125 False
digit-recognizer 2030-01-01 00:00:00 Getting Started Knowledge 2958 False
titanic 2030-01-01 00:00:00 Getting Started Knowledge 22881 True
house-prices-advanced-regression-techniques 2030-01-01 00:00:00 Getting Started Knowledge 4985 True
connectx 2030-01-01 00:00:00 Getting Started Knowledge 673 False
nlp-getting-started 2030-01-01 00:00:00 Getting Started Kudos 1455 True
competitive-data-science-predict-future-sales 2020-12-31 23:59:00 Playground Kudos 7626 False
halite 2020-09-15 23:59:00 Featured Swag 534 False
birdsong-recognition 2020-09-15 23:59:00 Research $25,000 244 False
landmark-retrieval-2020 2020-08-17 23:59:00 Research $25,000 53 False
siim-isic-melanoma-classification 2020-08-17 23:59:00 Featured $30,000 1672 False
global-wheat-detection 2020-08-04 23:59:00 Research $15,000 1353 False
open-images-object-detection-rvc-2020 2020-07-31 16:00:00 Playground Knowledge 45 False
open-images-instance-segmentation-rvc-2020 2020-07-31 16:00:00 Playground Knowledge 9 False
hashcode-photo-slideshow 2020-07-27 23:59:00 Playground Knowledge 50 False
prostate-cancer-grade-assessment 2020-07-22 23:59:00 Featured $25,000 765 False
alaska2-image-steganalysis 2020-07-20 23:59:00 Research $25,000 869 False
m5-forecasting-accuracy 2020-06-30 23:59:00 Featured $50,000 5558 True
m5-forecasting-uncertainty 2020-06-30 23:59:00 Featured $50,000 909 False
trends-assessment-prediction 2020-06-29 23:59:00 Research $25,000 1047 False
!kaggle competitions download -c titanic
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.6 / client 1.5.4)
gender_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv: Skipping, found more recently modified local copy (use --force to force download)
chloevan_key.pem gender_submission.csv sample_data test.csv train.csv
import pandas as pd
titanic_df = pd.read_csv(r'train.csv')
titanic_df.head(3)
print('titanic 변수 type:', type(titanic_df))
titanic 변수 type: <class 'pandas.core.frame.DataFrame'>
II. 데이터 핸들링을 위한 주요 함수 소개
- 본 장에서는 데이터 핸들링을 위한 몇가지 주요함수를 소개한다.
(1) value_counts()
value_counts()는 해당 칼럼값의 데이터 유형과 건수를 반환함
val_count = titanic_df['Embarked'].value_counts()
print(type(val_count))
print(val_count)
<class 'pandas.core.series.Series'>
S 644
C 168
Q 77
Name: Embarked, dtype: int64
(2) 데이터프레임 일부 삭제
drop()는 axis의 기준에 따라서 행과 열의 데이터를 삭제한다.
- 이 때, 주요 파라미터는
labels, inplace, axis에 따라 구분된다.
- labels: 컬럼명 또는 행의 인덱스
- inplace: 데이터 업데이트
- axis:
0은 행 방향, 1은 컬럼 방향
axis=1를 활용하여 우선 컬럼명을 삭제한다.
data = titanic_df.copy()
data_drop = data.drop(labels = 'Age', axis=1)
data_drop.head()
<