Kaggle Survey Data Transformation Tip

2021-11-18

Page content

Intro

Data Transformation is always important to visualise.
Here, I just introduced to get value counts in different dataset.
If you are newbie, please be aware of this code before you dive into visualization.

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv
/kaggle/input/kaggle-survey-2021/supplementary_data/kaggle_survey_2021_methodology.pdf
/kaggle/input/kaggle-survey-2021/supplementary_data/kaggle_survey_2021_answer_choices.pdf

Data Import

Import raw data and split into questions dataset and survey dataset.

df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
questions = df.iloc[0, :].T
questions

/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3441: DtypeWarning: Columns (0,195,201,285,286,287,288,289,290,291,292) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)





Time from Start to Finish (seconds)                                Duration (in seconds)
Q1                                                           What is your age (# years)?
Q2                                                What is your gender? - Selected Choice
Q3                                             In which country do you currently reside?
Q4                                     What is the highest level of formal education ...
                                                             ...                        
Q38_B_Part_8                           In the next 2 years, do you hope to become mor...
Q38_B_Part_9                           In the next 2 years, do you hope to become mor...
Q38_B_Part_10                          In the next 2 years, do you hope to become mor...
Q38_B_Part_11                          In the next 2 years, do you hope to become mor...
Q38_B_OTHER                            In the next 2 years, do you hope to become mor...
Name: 0, Length: 369, dtype: object

df = df.iloc[1:, :]

Quick Data Review

All survey responses are count-based dataset.
It’s easy to check using value counts()

df['Q1'].value_counts()

25-29    4931
18-21    4901
22-24    4694
30-34    3441
35-39    2504
40-44    1890
45-49    1375
50-54     964
55-59     592
60-69     553
70+       128
Name: Q1, dtype: int64

Problem

Some questions are not easy to counts because of Supplementary Questions.

questions.index.tolist()[7:20]

['Q7_Part_1',
 'Q7_Part_2',
 'Q7_Part_3',
 'Q7_Part_4',
 'Q7_Part_5',
 'Q7_Part_6',
 'Q7_Part_7',
 'Q7_Part_8',
 'Q7_Part_9',
 'Q7_Part_10',
 'Q7_Part_11',
 'Q7_Part_12',
 'Q7_OTHER']

For this we need another way to combine into one dataset.
Many Questions are very similar like Q7.
Let’s Create function.
Main Reference is here: https://www.kaggle.com/ruchi798/kaggle-ml-ds-survey-analysis
Just add some if_condition.

def sub_questions_count(question_num, part_num, text = False):
  part_questions = []

  if text in ["A", "B"]:
    part_questions = ['Q' + str(question_num) + "_" + text + '_Part_' + str(j) for j in range(1, part_num)]
    part_questions.append('Q' + str(question_num) + "_" + text + '_OTHER')
  else:
    part_questions = ['Q' + str(question_num) + '_Part_' + str(j) for j in range(1, part_num)]
    part_questions.append('Q' + str(question_num) + '_OTHER')

  # category count
  categories = []
  counts = []
  for i in part_questions:
    category = df[i].value_counts().index[0]
    val = df[i].value_counts()[0]
    categories.append(category)
    counts.append(val)

  combined_df = pd.DataFrame()
  combined_df['Category'] = categories
  combined_df['Count'] = counts

  combined_df = combined_df.sort_values(['Count'], ascending = False)
  return combined_df

Test

Case 1

# Test 
# 'Q38_B_Part_11',
print(sub_questions_count(38, 11, "B").reset_index(drop=True))

                  Category  Count
0             TensorBoard    4239
1                  MLflow    2747
2        Weights & Biases    1583
3              Neptune.ai    1276
4                 ClearML    1020
5                Polyaxon     737
6                Guild.ai     729
7    Domino Model Monitor     666
8                Comet.ml     633
9      Sacred + Omniboard     591
10                   Other    377

Case 2.

# 'Q36_A_Part_7',
print(sub_questions_count(36, 7, "A").reset_index(drop=True))

                                            Category  Count
0  Automated model selection (e.g. auto-sklearn, ...   1186
1  Automated data augmentation (e.g. imgaug, albu...   1009
2  Automated hyperparameter tuning (e.g. hyperopt...    952
3  Automation of full ML pipelines (e.g. Google A...    893
4  Automated feature engineering/selection (e.g. ...    781
5  Automated model architecture searches (e.g. da...    344
6                                              Other    214

# 'Q39_Part_9',
print(sub_questions_count(39, 9).reset_index(drop=True))

          Category  Count
0          GitHub    4586
1          Kaggle    3065
2           Colab    1848
3   Personal blog     705
4       Streamlit     387
5            Other    306
6           Shiny     305
7     Plotly Dash     293
8        NBViewer     136

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기