Basic Data Transformation

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

개요

List는 파이썬 데이터 타입의 기본 자료형이다.
Pandas 데이터 분석을 위한 기본적인 자료형이다.
List에서 Pandas로 변환하는 작업의 다양한 방법을 활용해본다.

방법 1. 기초

List를 생성한 후, 데이터프레임으로 변환한다.
여기에서는 column과 index값을 확인해본다.

import pandas as pd

lst = ["Korea", "Japan", "USA", "China", "Russia"]
data = pd.DataFrame(lst)
print(data)

        0
0   Korea
1   Japan
2     USA
3   China
4  Russia

방법 2. Column & Index 추가

이번에는 column과 index를 추가한다.

lst = ["Korea", "Japan", "USA", "China", "Russia"]
country_index = ["a", "b", "c", "d", "e"]
data = pd.DataFrame(lst, index = country_index, columns=["Country"])
print(data)

  Country
a   Korea
b   Japan
c     USA
d   China
e  Russia

방법 3. 두개의 리스트와 Zip 활용

이번에는 zip()함수를 활용하기에 앞서서, zip() 함수를 이해한다.

a = ["김", "심", "홍"]
b = ["길동", "청이", "길동"]

x = zip(a, b)
print(list(x))

[('김', '길동'), ('심', '청이'), ('홍', '길동')]

위 결과값이 말해주는 것처럼 동일한 개수로 이루어진 자료향을 묶어 주는 역할을 한다.
이를 활용하여 두개의 리스트를 판다스 데이터프레임으로 변환해준다.

full_name = pd.DataFrame(list(zip(a, b)), 
                         columns = ["성", "이름"])

print(full_name)

   성  이름
0  김  길동
1  심  청이
2  홍  길동

방법 4. Dictionary 활용

List와 Dictionary를 활용하여 데이터 프레임을 작성한다.

# 3개의 리스트
name = ["홍길동", "심청이", "임꺽정"]
age = [30, 40, 50]
gender = ["남성", "여성", "남성"]

# 딕셔너리 생성
dict = {"성함": name, "나이": age, "성별": gender}

class_df = pd.DataFrame(dict)
print(class_df)

    성함  나이  성별
0  홍길동  30  남성
1  심청이  40  여성
2  임꺽정  50  남성

엑셀로 내보내기

이렇게 생성된 데이터를 엑셀로 내보내기를 해본다.

class_df.to_excel("class_df.xlsx", sheet_name='1반', index = False)

실제로 내보내기가 되었는지 확인해본다.

%ls

class_df.xlsx

Reference

Shivam_k. (2018). Create a Pandas DataFrame from Lists. Retreived from https://www.geeksforgeeks.org/create-a-pandas-dataframe-from-lists/

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

I. Kaggle에서 타이타닉 데이터 가져오기

캐글 데이터 가져오는 예제는 본 Kaggle with Google Colab에서 참고하기를 바란다.
먼저 kaggle 패키지를 설치한다.

!pip install kaggle

Requirement already satisfied: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.6)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.12.0)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.8.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.41.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.0.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2020.6.20)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.9)

kaggle 인증키를 업로드 하여 권한 부여 한다.

from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json





{'kaggle.json': b'{"username":"j2hoon85","key":"5a23c8dba5a151100b483a587eafdac8"}'}

!mkdir -p ~/.kaggle # 파일 생성
!mv kaggle.json ~/.kaggle/ # kaggle.json 파일 이동
!chmod 600 ~/.kaggle/kaggle.json # 권한 부여

!kaggle competitions list

Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.6 / client 1.5.4)
ref                                            deadline             category            reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
tpu-getting-started                            2030-06-03 23:59:00  Getting Started      Kudos        125           False  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       2958           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge      22881            True  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge       4985            True  
connectx                                       2030-01-01 00:00:00  Getting Started  Knowledge        673           False  
nlp-getting-started                            2030-01-01 00:00:00  Getting Started      Kudos       1455            True  
competitive-data-science-predict-future-sales  2020-12-31 23:59:00  Playground           Kudos       7626           False  
halite                                         2020-09-15 23:59:00  Featured              Swag        534           False  
birdsong-recognition                           2020-09-15 23:59:00  Research           $25,000        244           False  
landmark-retrieval-2020                        2020-08-17 23:59:00  Research           $25,000         53           False  
siim-isic-melanoma-classification              2020-08-17 23:59:00  Featured           $30,000       1672           False  
global-wheat-detection                         2020-08-04 23:59:00  Research           $15,000       1353           False  
open-images-object-detection-rvc-2020          2020-07-31 16:00:00  Playground       Knowledge         45           False  
open-images-instance-segmentation-rvc-2020     2020-07-31 16:00:00  Playground       Knowledge          9           False  
hashcode-photo-slideshow                       2020-07-27 23:59:00  Playground       Knowledge         50           False  
prostate-cancer-grade-assessment               2020-07-22 23:59:00  Featured           $25,000        765           False  
alaska2-image-steganalysis                     2020-07-20 23:59:00  Research           $25,000        869           False  
m5-forecasting-accuracy                        2020-06-30 23:59:00  Featured           $50,000       5558            True  
m5-forecasting-uncertainty                     2020-06-30 23:59:00  Featured           $50,000        909           False  
trends-assessment-prediction                   2020-06-29 23:59:00  Research           $25,000       1047           False

캐글에서 데이터를 내려받는다.

!kaggle competitions download -c titanic

Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.6 / client 1.5.4)
gender_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv: Skipping, found more recently modified local copy (use --force to force download)

!ls

chloevan_key.pem  gender_submission.csv  sample_data  test.csv	train.csv

이제, 판다스를 활용해서 데이터를 불러온다.

import pandas as pd

titanic_df = pd.read_csv(r'train.csv')
titanic_df.head(3)

print('titanic 변수 type:', type(titanic_df))

titanic 변수 type: <class 'pandas.core.frame.DataFrame'>

II. 데이터 핸들링을 위한 주요 함수 소개

본 장에서는 데이터 핸들링을 위한 몇가지 주요함수를 소개한다.

(1) value_counts()

value_counts()는 해당 칼럼값의 데이터 유형과 건수를 반환함

val_count = titanic_df['Embarked'].value_counts()
print(type(val_count))
print(val_count)

<class 'pandas.core.series.Series'>
S    644
C    168
Q     77
Name: Embarked, dtype: int64

(2) 데이터프레임 일부 삭제

drop()는 axis의 기준에 따라서 행과 열의 데이터를 삭제한다.
이 때, 주요 파라미터는 labels, inplace, axis에 따라 구분된다.
- labels: 컬럼명 또는 행의 인덱스
- inplace: 데이터 업데이트
- axis: 0은 행 방향, 1은 컬럼 방향
axis=1를 활용하여 우선 컬럼명을 삭제한다.

data = titanic_df.copy()
data_drop = data.drop(labels = 'Age', axis=1)
data_drop.head()

Basic Data Transformation

List to Pandas

강의 홍보

개요

방법 1. 기초

방법 2. Column & Index 추가

방법 3. 두개의 리스트와 Zip 활용

방법 4. Dictionary 활용

엑셀로 내보내기

Reference

Pandas Data Handling 1편

강의 홍보

I. Kaggle에서 타이타닉 데이터 가져오기

II. 데이터 핸들링을 위한 주요 함수 소개

(1) value_counts()

(2) 데이터프레임 일부 삭제