Tutorial of Ranzcr EDA

2021-02-11

Python, Kaggle, Ranzcr

Page content

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

Competition

https://www.kaggle.com/c/ranzcr-clip-catheter-line-classification

Intro

Thanks to RANZCR/resnext50_32x4d starter [training]
- Please visit here and upvote

import os

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

Check File Size

Check Each Size of Dataset Folder in this competition
- train_records = 4.5GB
- test_tfrecords = 0.5MB
- train (image data) = 6.5GB
- test (image data) = 0.8MB

import os

def get_folder_size(file_directory):
  # file_list = os.listdir(file_directory)
  dir_sizes = {}
  for r, d, f in os.walk(file_directory, False):
      size = sum(os.path.getsize(os.path.join(r,f)) for f in f+d)
      size += sum(dir_sizes[os.path.join(r,d)] for d in d)
      dir_sizes[r] = size
      print("{} is {} MB".format(r, round(size/2**20), 2))      
  
base_dir = '../input/ranzcr-clip-catheter-line-classification'
get_folder_size(base_dir)

../input/ranzcr-clip-catheter-line-classification/test is 805 MB
../input/ranzcr-clip-catheter-line-classification/test_tfrecords is 555 MB
../input/ranzcr-clip-catheter-line-classification/train_tfrecords is 4563 MB
../input/ranzcr-clip-catheter-line-classification/train is 6592 MB
../input/ranzcr-clip-catheter-line-classification is 12524 MB

Check train file

Let’s descirbe train

train = pd.read_csv('../input/ranzcr-clip-catheter-line-classification/train.csv', index_col = 0)
test = pd.read_csv('../input/ranzcr-clip-catheter-line-classification/sample_submission.csv', index_col = 0)
display(train.head())
display(test.head())

	ETT - Abnormal	ETT - Borderline	ETT - Normal	NGT - Abnormal	NGT - Borderline	NGT - Incompletely Imaged	NGT - Normal	CVC - Abnormal	CVC - Borderline	CVC - Normal	Swan Ganz Catheter Present	PatientID
StudyInstanceUID
1.2.826.0.1.3680043.8.498.26697628953273228189375557799582420561	0	0	0	0	0	0	1	0	0	0	0	ec89415d1
1.2.826.0.1.3680043.8.498.46302891597398758759818628675365157729	0	0	1	0	0	1	0	0	0	1	0	bf4c6da3c
1.2.826.0.1.3680043.8.498.23819260719748494858948050424870692577	0	0	0	0	0	0	0	0	1	0	0	3fc1c97e5
1.2.826.0.1.3680043.8.498.68286643202323212801283518367144358744	0	0	0	0	0	0	0	1	0	0	0	c31019814
1.2.826.0.1.3680043.8.498.10050203009225938259119000528814762175	0	0	0	0	0	0	0	0	0	1	0	207685cd1

	ETT - Abnormal	ETT - Borderline	ETT - Normal	NGT - Abnormal	NGT - Borderline	NGT - Incompletely Imaged	NGT - Normal	CVC - Abnormal	CVC - Borderline	CVC - Normal	Swan Ganz Catheter Present
StudyInstanceUID
1.2.826.0.1.3680043.8.498.46923145579096002617106567297135160932	0	0	0	0	0	0	0	0	0	0	0
1.2.826.0.1.3680043.8.498.84006870182611080091824109767561564887	0	0	0	0	0	0	0	0	0	0	0
1.2.826.0.1.3680043.8.498.12219033294413119947515494720687541672	0	0	0	0	0	0	0	0	0	0	0
1.2.826.0.1.3680043.8.498.84994474380235968109906845540706092671	0	0	0	0	0	0	0	0	0	0	0
1.2.826.0.1.3680043.8.498.35798987793805669662572108881745201372	0	0	0	0	0	0	0	0	0	0	0

Definitions of Variables

What’s inside data?
- StudyInstanceUID - unique ID for each image
- ETT - Abnormal - endotracheal tube placement abnormal
- ETT - Borderline - endotracheal tube placement borderline abnormal
- ETT - Normal - endotracheal tube placement normal
- NGT - Abnormal - nasogastric tube placement abnormal
- NGT - Borderline - nasogastric tube placement borderline abnormal
- NGT - Incompletely Imaged - nasogastric tube placement inconclusive due to imaging
- NGT - Normal - nasogastric tube placement borderline normal
- CVC - Abnormal - central venous catheter placement abnormal
- CVC - Borderline - central venous catheter placement borderline abnormal
- CVC - Normal - central venous catheter placement normal
- Swan Ganz Catheter Present(??)
- PatientID - unique ID for each patient in the dataset

Data Distribution of Each Variable

why two calculations are different?
- When inserting catheters and lines into patients, some patients needs them to put on multiple positions.
- Let’s see PatientID - bf4c6da3c
But, you realize that three groups - ETT, NGT, CVC counted seperately.

print("Total Rows of Train Data is", len(train))
print("Total Count of Each Variable in Train Data is", train.iloc[:, :-1].sum().sum())

var_cal_tmp = train.iloc[:, :-1].sum()
print(var_cal_tmp)

Total Rows of Train Data is 30083
Total Count of Each Variable in Train Data is 50619
ETT - Abnormal                   79
ETT - Borderline               1138
ETT - Normal                   7240
NGT - Abnormal                  279
NGT - Borderline                529
NGT - Incompletely Imaged      2748
NGT - Normal                   4797
CVC - Abnormal                 3195
CVC - Borderline               8460
CVC - Normal                  21324
Swan Ganz Catheter Present      830
dtype: int64

train.iloc[1].to_frame().T

	ETT - Abnormal	ETT - Borderline	ETT - Normal	NGT - Abnormal	NGT - Borderline	NGT - Incompletely Imaged	NGT - Normal	CVC - Abnormal	CVC - Borderline	CVC - Normal	Swan Ganz Catheter Present	PatientID
1.2.826.0.1.3680043.8.498.46302891597398758759818628675365157729	0	0	1	0	0	1	0	0	0	1	0	bf4c6da3c

Quick Visualization

In general, CVC outnumbered other group.

fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x = var_cal_tmp.values, y = var_cal_tmp.index, ax=ax)
ax.tick_params(axis="x", labelsize=14)
ax.tick_params(axis="y", labelsize=14)
ax.set_xlabel("Number of Images", fontsize=15)
ax.set_title("Distribution of Labels", fontsize=15)

Text(0.5, 1.0, 'Distribution of Labels')

png

The number of Patients are smaller than total data.
It means some patients are frequently checked, depending upon patients

print("Number of Unique Patients: ", train["PatientID"].unique().shape[0])
print("Number of Total Data: ", len(train["PatientID"]))

Number of Unique Patients:  3255
Number of Total Data:  30083

tmp = train['PatientID'].value_counts()
print(tmp)
fig, ax = plt.subplots(figsize=(24, 6))
sns.countplot(x = tmp.values, ax=ax)
ax.tick_params(axis="x", labelsize=10)
ax.tick_params(axis="y", labelsize=14)
ax.set_xlabel("Number of Images", fontsize=15)
ax.set_title("Distribution of Labels", fontsize=15)

05029c63a    172
55073fece    167
26da0d5ad    148
8849382d0    130
34242119f    110
            ... 
ad32e88e0      1
7755053cb      1
2d5a5f0d0      1
1951dc11c      1
22e8f333f      1
Name: PatientID, Length: 3255, dtype: int64





Text(0.5, 1.0, 'Distribution of Labels')

png

Now, we need to see the distribution of data in each variable.

target_cols = ['ETT - Abnormal', 'ETT - Borderline', 'ETT - Normal', 'NGT - Abnormal', 
               'NGT - Borderline', 'NGT - Incompletely Imaged', 'NGT - Normal', 'CVC - Abnormal',
               'CVC - Borderline', 'CVC - Normal', 'Swan Ganz Catheter Present']

fig, ax = plt.subplots(4, 3, figsize=(16, 10))
for i, col in enumerate(train[target_cols].columns[0:]):
  print(i, col)
  if i <= 2:
    ax[0, i].hist(train[col].values)
    ax[0, i].set_title(f'target: {col}')
  elif i <= 5:
    ax[1, i-3].hist(train[col].values)
    ax[1, i-3].set_title(f'target: {col}')
  elif i <= 8:
    ax[2, i-6].hist(train[col].values)
    ax[2, i-6].set_title(f'target: {col}')
  else:
    ax[3, i-9].hist(train[col].values)
    ax[3, i-9].set_title(f'target: {col}')

fig.tight_layout()
fig.subplots_adjust(top=0.95)

0 ETT - Abnormal
1 ETT - Borderline
2 ETT - Normal
3 NGT - Abnormal
4 NGT - Borderline
5 NGT - Incompletely Imaged
6 NGT - Normal
7 CVC - Abnormal
8 CVC - Borderline
9 CVC - Normal
10 Swan Ganz Catheter Present

png

How to interpret the graph?
- CVC group is the top most amongst groups
- In each group, Normal is the top most.
This datasets are typically imbalanced, and multi-classification problem is revealed.

Background Knowledge

Since my major is far from this medical area, it difficults to figure what to classify from images.
So, need some videos to understand the processing.
Thanks to RANZCR CLiP: Visualize and Understand Dataset
- Please visit here and upvote

Endotracheal Tube¶

It’s so called ETT in this dataset.

from IPython.display import YouTubeVideo
YouTubeVideo('FtJr7i7ENMY')

Nasogastric Tube

It’s so called NTT in this dataset.

YouTubeVideo('Abf3Gd6AaZQ')

Central venous catheter

It’s so called CVC in this dataset.

YouTubeVideo('mTBrCMn86cU')

Swan Ganz Catheter Present

It’s Swan Ganz Catheter Present

YouTubeVideo('YkN30T6ig30')

Check train annotation file

What’s Inside train_annotations file?
- The main purpose is said that ‘These are segmentation annotations for training samples that have them. They are included solely as additional information for competitors.’
Let’s look at data

annot = pd.read_csv("../input/ranzcr-clip-catheter-line-classification/train_annotations.csv")
annot.head(10)

	StudyInstanceUID	label	data
0	1.2.826.0.1.3680043.8.498.12616281126973421762...	CVC - Normal	[[1487, 1279], [1477, 1168], [1472, 1052], [14...
1	1.2.826.0.1.3680043.8.498.12616281126973421762...	CVC - Normal	[[1328, 7], [1347, 101], [1383, 193], [1400, 2...
2	1.2.826.0.1.3680043.8.498.72921907356394389969...	CVC - Borderline	[[801, 1207], [812, 1112], [823, 1023], [842, ...
3	1.2.826.0.1.3680043.8.498.11697104485452001927...	CVC - Normal	[[1366, 961], [1411, 861], [1453, 751], [1508,...
4	1.2.826.0.1.3680043.8.498.87704688663091069148...	NGT - Normal	[[1862, 14], [1845, 293], [1801, 869], [1716, ...
5	1.2.826.0.1.3680043.8.498.87704688663091069148...	CVC - Normal	[[906, 604], [1103, 578], [1242, 607], [1459, ...
6	1.2.826.0.1.3680043.8.498.87704688663091069148...	ETT - Normal	[[1781, 804], [1801, 666], [1791, 496], [1798,...
7	1.2.826.0.1.3680043.8.498.53113362093090654004...	CVC - Normal	[[1152, 938], [1193, 856], [1265, 795], [1362,...
8	1.2.826.0.1.3680043.8.498.83331936392921199432...	NGT - Normal	[[1903, 73], [1934, 768], [1917, 1061], [1866,...
9	1.2.826.0.1.3680043.8.498.83331936392921199432...	CVC - Normal	[[92, 1857], [163, 1936], [251, 1917], [282, 1...

Visualization of X-rays image

combined train + train_annotations, let’s draw sample image

from PIL import Image, ImageDraw

def train_base_chest_plot(row_ind, base_dir):
    row = annot.loc[row_ind]
    train_img = Image.open(base_dir + row['StudyInstanceUID'] + '.jpg')
    uid = row['StudyInstanceUID']
    label = row['label']
    fig, ax = plt.subplots(figsize=(15, 6))
    ax.imshow(train_img)
    plt.title(f"train: {label}")

base_dir = '../input/ranzcr-clip-catheter-line-classification/train/'
train_base_chest_plot(1, base_dir)

png

But, what we need is to draw tube. Thus, we need to use column ‘data’ in this plot. Let’s do this.

import ast 
import numpy as np

def train_base_tube_plot(row_ind, base_dir):
    row = annot.loc[row_ind]
    train_img = Image.open(base_dir + row['StudyInstanceUID'] + '.jpg')
    uid = row['StudyInstanceUID']
    label = row['label']
    data = np.array(ast.literal_eval(row['data']))
    fig, ax = plt.subplots(figsize=(15, 6))
    ax.imshow(train_img)
    ax.plot(data[:, 0], data[:, 1], color = 'b', linewidth=2, marker='o')
    plt.title(f"train: {label}")

base_dir = '../input/ranzcr-clip-catheter-line-classification/train/'
train_base_tube_plot(1, base_dir)
train_base_tube_plot(2, base_dir)
train_base_tube_plot(25, base_dir)

png

Well, still difficult to figure out what the difference between normal and abnormal is. So, Droped to draw more.