  • Data Transformation is always important to visualise.
  • Here, I just introduced to get value counts in different dataset.
  • If you are newbie, please be aware of this code before you dive into visualization.
Data Import

  • Import raw data and split into questions dataset and survey dataset.
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
questions = df.iloc[0, :].T
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3441: DtypeWarning: Columns (0,195,201,285,286,287,288,289,290,291,292) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

Time from Start to Finish (seconds)                                Duration (in seconds)
Q1                                                           What is your age (# years)?
Q2                                                What is your gender? - Selected Choice
Q3                                             In which country do you currently reside?
Q4                                     What is the highest level of formal education ...
Q38_B_Part_8                           In the next 2 years, do you hope to become mor...
Q38_B_Part_9                           In the next 2 years, do you hope to become mor...
Q38_B_Part_10                          In the next 2 years, do you hope to become mor...
Q38_B_Part_11                          In the next 2 years, do you hope to become mor...
Q38_B_OTHER                            In the next 2 years, do you hope to become mor...
Name: 0, Length: 369, dtype: object
df = df.iloc[1:, :]

Quick Data Review

  • All survey responses are count-based dataset.
  • It’s easy to check using value counts()
25-29    4931
18-21    4901
22-24    4694
30-34    3441
35-39    2504
40-44    1890
45-49    1375
50-54     964
55-59     592
60-69     553
70+       128
Name: Q1, dtype: int64


  • Some questions are not easy to counts because of Supplementary Questions.
def sub_questions_count(question_num, part_num, text = False):
  part_questions = []

  if text in ["A", "B"]:
    part_questions = ['Q' + str(question_num) + "_" + text + '_Part_' + str(j) for j in range(1, part_num)]
    part_questions.append('Q' + str(question_num) + "_" + text + '_OTHER')
    part_questions = ['Q' + str(question_num) + '_Part_' + str(j) for j in range(1, part_num)]
    part_questions.append('Q' + str(question_num) + '_OTHER')

  # category count
  categories = []
  counts = []
  for i in part_questions:
    category = df[i].value_counts().index[0]
    val = df[i].value_counts()[0]

  combined_df = pd.DataFrame()
  combined_df['Category'] = categories
  combined_df['Count'] = counts

  combined_df = combined_df.sort_values(['Count'], ascending = False)
  return combined_df


  • Case 1
# Test 
# 'Q38_B_Part_11',
print(sub_questions_count(38, 11, "B").reset_index(drop=True))
                  Category  Count
0             TensorBoard    4239
1                  MLflow    2747
2        Weights & Biases    1583
3              Neptune.ai    1276
4                 ClearML    1020
5                Polyaxon     737
6                Guild.ai     729
7    Domino Model Monitor     666
8                Comet.ml     633
9      Sacred + Omniboard     591
10                   Other    377

Case 2.

# 'Q36_A_Part_7',
print(sub_questions_count(36, 7, "A").reset_index(drop=True))
                                            Category  Count
0  Automated model selection (e.g. auto-sklearn, ...   1186
1  Automated data augmentation (e.g. imgaug, albu...   1009
2  Automated hyperparameter tuning (e.g. hyperopt...    952
3  Automation of full ML pipelines (e.g. Google A...    893
4  Automated feature engineering/selection (e.g. ...    781
5  Automated model architecture searches (e.g. da...    344
6                                              Other    214
# 'Q39_Part_9',
print(sub_questions_count(39, 9).reset_index(drop=True))
          Category  Count
0          GitHub    4586
1          Kaggle    3065
2           Colab    1848
3   Personal blog     705
4       Streamlit     387
5            Other    306
6           Shiny     305
7     Plotly Dash     293
8        NBViewer     136

