Kaggle Survey Data Transformation Tip
Page content
Intro
- Data Transformation is always important to visualise.
- Here, I just introduced to get value counts in different dataset.
- If you are newbie, please be aware of this code before you dive into visualization.
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv
/kaggle/input/kaggle-survey-2021/supplementary_data/kaggle_survey_2021_methodology.pdf
/kaggle/input/kaggle-survey-2021/supplementary_data/kaggle_survey_2021_answer_choices.pdf
Data Import
- Import raw data and split into questions dataset and survey dataset.
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
questions = df.iloc[0, :].T
questions
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3441: DtypeWarning: Columns (0,195,201,285,286,287,288,289,290,291,292) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Time from Start to Finish (seconds) Duration (in seconds)
Q1 What is your age (# years)?
Q2 What is your gender? - Selected Choice
Q3 In which country do you currently reside?
Q4 What is the highest level of formal education ...
...
Q38_B_Part_8 In the next 2 years, do you hope to become mor...
Q38_B_Part_9 In the next 2 years, do you hope to become mor...
Q38_B_Part_10 In the next 2 years, do you hope to become mor...
Q38_B_Part_11 In the next 2 years, do you hope to become mor...
Q38_B_OTHER In the next 2 years, do you hope to become mor...
Name: 0, Length: 369, dtype: object
df = df.iloc[1:, :]
Quick Data Review
- All survey responses are count-based dataset.
- It’s easy to check using value counts()
df['Q1'].value_counts()
25-29 4931
18-21 4901
22-24 4694
30-34 3441
35-39 2504
40-44 1890
45-49 1375
50-54 964
55-59 592
60-69 553
70+ 128
Name: Q1, dtype: int64
Problem
- Some questions are not easy to counts because of Supplementary Questions.
questions.index.tolist()[7:20]
['Q7_Part_1',
'Q7_Part_2',
'Q7_Part_3',
'Q7_Part_4',
'Q7_Part_5',
'Q7_Part_6',
'Q7_Part_7',
'Q7_Part_8',
'Q7_Part_9',
'Q7_Part_10',
'Q7_Part_11',
'Q7_Part_12',
'Q7_OTHER']
- For this we need another way to combine into one dataset.
- Many Questions are very similar like Q7.
- Let’s Create function.
- Main Reference is here: https://www.kaggle.com/ruchi798/kaggle-ml-ds-survey-analysis
- Just add some if_condition.
def sub_questions_count(question_num, part_num, text = False):
part_questions = []
if text in ["A", "B"]:
part_questions = ['Q' + str(question_num) + "_" + text + '_Part_' + str(j) for j in range(1, part_num)]
part_questions.append('Q' + str(question_num) + "_" + text + '_OTHER')
else:
part_questions = ['Q' + str(question_num) + '_Part_' + str(j) for j in range(1, part_num)]
part_questions.append('Q' + str(question_num) + '_OTHER')
# category count
categories = []
counts = []
for i in part_questions:
category = df[i].value_counts().index[0]
val = df[i].value_counts()[0]
categories.append(category)
counts.append(val)
combined_df = pd.DataFrame()
combined_df['Category'] = categories
combined_df['Count'] = counts
combined_df = combined_df.sort_values(['Count'], ascending = False)
return combined_df
Test
- Case 1
# Test
# 'Q38_B_Part_11',
print(sub_questions_count(38, 11, "B").reset_index(drop=True))
Category Count
0 TensorBoard 4239
1 MLflow 2747
2 Weights & Biases 1583
3 Neptune.ai 1276
4 ClearML 1020
5 Polyaxon 737
6 Guild.ai 729
7 Domino Model Monitor 666
8 Comet.ml 633
9 Sacred + Omniboard 591
10 Other 377
Case 2.
# 'Q36_A_Part_7',
print(sub_questions_count(36, 7, "A").reset_index(drop=True))
Category Count
0 Automated model selection (e.g. auto-sklearn, ... 1186
1 Automated data augmentation (e.g. imgaug, albu... 1009
2 Automated hyperparameter tuning (e.g. hyperopt... 952
3 Automation of full ML pipelines (e.g. Google A... 893
4 Automated feature engineering/selection (e.g. ... 781
5 Automated model architecture searches (e.g. da... 344
6 Other 214
# 'Q39_Part_9',
print(sub_questions_count(39, 9).reset_index(drop=True))
Category Count
0 GitHub 4586
1 Kaggle 3065
2 Colab 1848
3 Personal blog 705
4 Streamlit 387
5 Other 306
6 Shiny 305
7 Plotly Dash 293
8 NBViewer 136
강의 홍보
- 취준생을 위한 강의를 제작하였습니다.
- 본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
스타벅스 아이스 아메리카노를 선물
로 보내드리겠습니다.
- [비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기