Kaggle-Python-Bigquery 연동 예제

Page content

1줄 요약

  • 캐글 데이터를 빅쿼리에 넣어보

캐글 데이터 다운로드

  • 캐글 데이터를 다운로드 받습니다.
!pip install kaggle
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2020.12.5)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.41.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.0.1)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3)
!mkdir ~/.kaggle
!echo '{"username":"your_id","key":"your_key"}' > ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c tabular-playground-series-apr-2021
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.12 / client 1.5.4)
Downloading test.csv.zip to /content
  0% 0.00/2.07M [00:00<?, ?B/s]
100% 2.07M/2.07M [00:00<00:00, 59.0MB/s]
Downloading train.csv.zip to /content
  0% 0.00/2.13M [00:00<?, ?B/s]
100% 2.13M/2.13M [00:00<00:00, 69.3MB/s]
Downloading sample_submission.csv to /content
  0% 0.00/879k [00:00<?, ?B/s]
100% 879k/879k [00:00<00:00, 124MB/s]
!ls
sample_data  sample_submission.csv  test.csv.zip  train.csv.zip
!unzip "*.zip"
Archive:  train.csv.zip
  inflating: train.csv               

Archive:  test.csv.zip
  inflating: test.csv                

2 archives were successfully processed.

사용자 계정 인증

from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated

빅쿼리 사용 예제

  • 빅쿼리 사용에 앞서서 세팅을 해야 합니다.

  • 가입을 진행한 이후에는 새로운 프로젝트를 만듭니다.

%%bigquery --project hkitadas
SELECT 
  COUNT(*) as total_rows
FROM `bigquery-public-data.samples.gsod`
total_rows
0 114420316
!ls
adc.json     sample_submission.csv  test.csv.zip  train.csv.zip
sample_data  test.csv		    train.csv

빅쿼리 파이썬 연동

  • Google Colab에서 빅쿼리를 사용할 수 있도록 라이브러리를 설치한다.
!pip install --upgrade pandas-gbq 'google-cloud-bigquery[bqstorage,pandas]'
Collecting pandas-gbq
  Downloading https://files.pythonhosted.org/packages/33/bd/7178c52d1c19d577bc972c80d2486541631794a5ed8c8f5178a26d61c1be/pandas_gbq-0.15.0-py3-none-any.whl
Collecting google-cloud-bigquery[bqstorage,pandas]
[?25l  Downloading https://files.pythonhosted.org/packages/f7/c1/510fbde69f8ca9d0253326230fe1c36819f94099532312f1174e870f1112/google_cloud_bigquery-2.13.1-py2.py3-none-any.whl (216kB)
     |████████████████████████████████| 225kB 4.6MB/s 
[?25hRequirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.7/dist-packages (from pandas-gbq) (54.2.0)
Requirement already satisfied, skipping upgrade: google-auth-oauthlib in /usr/local/lib/python3.7/dist-packages (from pandas-gbq) (0.4.4)
Requirement already satisfied, skipping upgrade: pydata-google-auth in /usr/local/lib/python3.7/dist-packages (from pandas-gbq) (1.1.0)
Requirement already satisfied, skipping upgrade: pandas>=0.23.2 in /usr/local/lib/python3.7/dist-packages (from pandas-gbq) (1.1.5)
Requirement already satisfied, skipping upgrade: google-auth in /usr/local/lib/python3.7/dist-packages (from pandas-gbq) (1.28.1)
Requirement already satisfied, skipping upgrade: packaging>=14.3 in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery[bqstorage,pandas]) (20.9)
Collecting proto-plus>=1.10.0
[?25l  Downloading https://files.pythonhosted.org/packages/b4/8a/61c5a9b9b6288f9b060b6e3d88374fc083953a29aeac7206616c2d3c9c8e/proto_plus-1.18.1-py3-none-any.whl (42kB)
     |████████████████████████████████| 51kB 4.4MB/s 
[?25hCollecting google-cloud-core<2.0dev,>=1.4.1
  Downloading https://files.pythonhosted.org/packages/ad/fc/6e8c449185cb8862af353c1164100ff75e32d55ba1de3baf9eaa01b7d2a9/google_cloud_core-1.6.0-py2.py3-none-any.whl
Requirement already satisfied, skipping upgrade: requests<3.0.0dev,>=2.18.0 in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.23.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.12.0 in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery[bqstorage,pandas]) (3.12.4)
Requirement already satisfied, skipping upgrade: google-api-core[grpc]<2.0.0dev,>=1.23.0 in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.26.3)
Collecting google-resumable-media<2.0dev,>=0.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/22/52/4b086e0d15245b648e2b6d408fb2f2974a3a5a405de5d7fae25cd085a3fa/google_resumable_media-1.2.0-py2.py3-none-any.whl (75kB)
     |████████████████████████████████| 81kB 5.5MB/s 
[?25hCollecting google-cloud-bigquery-storage<3.0.0dev,>=2.0.0; extra == "bqstorage"
[?25l  Downloading https://files.pythonhosted.org/packages/8d/de/7d88e746afade62857f4ebf4adf8048bddfcaceb8e30cc0746d54ee12794/google_cloud_bigquery_storage-2.4.0-py2.py3-none-any.whl (143kB)
     |████████████████████████████████| 153kB 5.8MB/s 
[?25hRequirement already satisfied, skipping upgrade: grpcio<2.0dev,>=1.32.0; extra == "bqstorage" in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.32.0)
Requirement already satisfied, skipping upgrade: pyarrow<4.0dev,>=1.0.0; extra == "bqstorage" in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery[bqstorage,pandas]) (3.0.0)
Requirement already satisfied, skipping upgrade: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib->pandas-gbq) (1.3.0)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.23.2->pandas-gbq) (2018.9)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.23.2->pandas-gbq) (2.8.1)
Requirement already satisfied, skipping upgrade: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.23.2->pandas-gbq) (1.19.5)
Requirement already satisfied, skipping upgrade: rsa<5,>=3.1.4; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from google-auth->pandas-gbq) (4.7.2)
Requirement already satisfied, skipping upgrade: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth->pandas-gbq) (4.2.1)
Requirement already satisfied, skipping upgrade: six>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from google-auth->pandas-gbq) (1.15.0)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth->pandas-gbq) (0.2.8)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=14.3->google-cloud-bigquery[bqstorage,pandas]) (2.4.7)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (2.10)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (3.0.4)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (1.24.3)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (2020.12.5)
Requirement already satisfied, skipping upgrade: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from google-api-core[grpc]<2.0.0dev,>=1.23.0->google-cloud-bigquery[bqstorage,pandas]) (1.53.0)
Collecting google-crc32c<2.0dev,>=1.0; python_version >= "3.5"
  Downloading https://files.pythonhosted.org/packages/fc/ae/b6efa1019e18c6c791f0f5cd93b2ff40f8f06696dbf04db39ec0f5591b1e/google_crc32c-1.1.2-cp37-cp37m-manylinux2014_x86_64.whl
Collecting libcst>=0.2.5
[?25l  Downloading https://files.pythonhosted.org/packages/c3/7f/4aa1419b0ecb8a31a79fef7a79b49e6a07b977baa6c94612aeeda0228d17/libcst-0.3.18-py3-none-any.whl (512kB)
     |████████████████████████████████| 522kB 7.9MB/s 
[?25hRequirement already satisfied, skipping upgrade: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib->pandas-gbq) (3.1.0)
Requirement already satisfied, skipping upgrade: pyasn1>=0.1.3 in /usr/local/lib/python3.7/dist-packages (from rsa<5,>=3.1.4; python_version >= "3.6"->google-auth->pandas-gbq) (0.4.8)
Requirement already satisfied, skipping upgrade: cffi>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from google-crc32c<2.0dev,>=1.0; python_version >= "3.5"->google-resumable-media<2.0dev,>=0.6.0->google-cloud-bigquery[bqstorage,pandas]) (1.14.5)
Requirement already satisfied, skipping upgrade: typing-extensions>=3.7.4.2 in /usr/local/lib/python3.7/dist-packages (from libcst>=0.2.5->google-cloud-bigquery-storage<3.0.0dev,>=2.0.0; extra == "bqstorage"->google-cloud-bigquery[bqstorage,pandas]) (3.7.4.3)
Collecting typing-inspect>=0.4.0
  Downloading https://files.pythonhosted.org/packages/42/1c/66402db44184904a2f14722d317a4da0b5c8c78acfc3faf74362566635c5/typing_inspect-0.6.0-py3-none-any.whl
Collecting pyyaml>=5.2
[?25l  Downloading https://files.pythonhosted.org/packages/7a/a5/393c087efdc78091afa2af9f1378762f9821c9c1d7a22c5753fb5ac5f97a/PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636kB)
     |████████████████████████████████| 645kB 10.5MB/s 
[?25hRequirement already satisfied, skipping upgrade: pycparser in /usr/local/lib/python3.7/dist-packages (from cffi>=1.0.0->google-crc32c<2.0dev,>=1.0; python_version >= "3.5"->google-resumable-media<2.0dev,>=0.6.0->google-cloud-bigquery[bqstorage,pandas]) (2.20)
Collecting mypy-extensions>=0.3.0
  Downloading https://files.pythonhosted.org/packages/5c/eb/975c7c080f3223a5cdaff09612f3a5221e4ba534f7039db34c35d95fa6a5/mypy_extensions-0.4.3-py2.py3-none-any.whl
ERROR: google-cloud-storage 1.18.1 has requirement google-resumable-media<0.5.0dev,>=0.3.1, but you'll have google-resumable-media 1.2.0 which is incompatible.
Installing collected packages: proto-plus, google-cloud-core, google-crc32c, google-resumable-media, mypy-extensions, typing-inspect, pyyaml, libcst, google-cloud-bigquery-storage, google-cloud-bigquery, pandas-gbq
  Found existing installation: google-cloud-core 1.0.3
    Uninstalling google-cloud-core-1.0.3:
      Successfully uninstalled google-cloud-core-1.0.3
  Found existing installation: google-resumable-media 0.4.1
    Uninstalling google-resumable-media-0.4.1:
      Successfully uninstalled google-resumable-media-0.4.1
  Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
  Found existing installation: google-cloud-bigquery-storage 1.1.0
    Uninstalling google-cloud-bigquery-storage-1.1.0:
      Successfully uninstalled google-cloud-bigquery-storage-1.1.0
  Found existing installation: google-cloud-bigquery 1.21.0
    Uninstalling google-cloud-bigquery-1.21.0:
      Successfully uninstalled google-cloud-bigquery-1.21.0
  Found existing installation: pandas-gbq 0.13.3
    Uninstalling pandas-gbq-0.13.3:
      Successfully uninstalled pandas-gbq-0.13.3
Successfully installed google-cloud-bigquery-2.13.1 google-cloud-bigquery-storage-2.4.0 google-cloud-core-1.6.0 google-crc32c-1.1.2 google-resumable-media-1.2.0 libcst-0.3.18 mypy-extensions-0.4.3 pandas-gbq-0.15.0 proto-plus-1.18.1 pyyaml-5.4.1 typing-inspect-0.6.0
  • 설치한 이후에는 간단하게 테스트를 진행한다.
from google.cloud import bigquery
from tabulate import tabulate
import pandas

project_id = 'hkitadas'
sql = """
  SELECT 
    COUNT(*) as total_rows
  FROM [bigquery-public-data.samples.gsod]
"""

df = pandas.read_gbq(sql, project_id=project_id, dialect='legacy')
df.head()
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=725825577420-unm2gnkiprugilg743tkbig250f4sfsj.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&state=sl3ecik54k9vvCgXT9IT9O8695hH9H&prompt=consent&access_type=offline
Enter the authorization code: 4/1AY0e-g4x2y52M9InlPexuKjm2F8zNyaixF6Kmm4c4uTpqaH8PuDQm4kSrEc
total_rows
0 114420316

데이터 적재하기

  • 데이터를 불러옵니다.
import pandas as pd

BASE_DIR = "./"
train = pd.read_csv(BASE_DIR + 'train.csv')
test = pd.read_csv(BASE_DIR + 'test.csv')

train.shape, test.shape
((100000, 12), (100000, 11))
  • 데이터를 적재하기 앞서서 테이블세트를 만듭니다.

  • 처음에는 train 데이터와 test 데이터가 존재하지 않습니다.
train_table_id = 'titanic.train'
train.to_gbq(train_table_id)

test_table_id = 'titanic.test'
test.to_gbq(test_table_id)
1it [00:07,  7.22s/it]
1it [00:05,  5.88s/it]
  • 이제 마지막으로 확인해봅니다.

데이터 불러오기

  • 이제 마지막으로 빅쿼리에서 데이터를 불러옵니다.
project_id = 'hkitadas'
train_sql = """
  SELECT 
    *
  FROM [hkitadas.titanic.train]
"""

test_sql = """
  SELECT 
    *
  FROM [hkitadas.titanic.test]
"""

train = pandas.read_gbq(train_sql, project_id=project_id, dialect='legacy')
test = pandas.read_gbq(test_sql, project_id=project_id, dialect='legacy')

train.shape, test.shape
((100000, 12), (100000, 11))

References