Tutorial of Ranzcr EDA

Page content

강의 홍보

Competition

Intro

import os

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

Check File Size

  • Check Each Size of Dataset Folder in this competition
    • train_records = 4.5GB
    • test_tfrecords = 0.5MB
    • train (image data) = 6.5GB
    • test (image data) = 0.8MB
import os

def get_folder_size(file_directory):
  # file_list = os.listdir(file_directory)
  dir_sizes = {}
  for r, d, f in os.walk(file_directory, False):
      size = sum(os.path.getsize(os.path.join(r,f)) for f in f+d)
      size += sum(dir_sizes[os.path.join(r,d)] for d in d)
      dir_sizes[r] = size
      print("{} is {} MB".format(r, round(size/2**20), 2))      
  
base_dir = '../input/ranzcr-clip-catheter-line-classification'
get_folder_size(base_dir)
../input/ranzcr-clip-catheter-line-classification/test is 805 MB
../input/ranzcr-clip-catheter-line-classification/test_tfrecords is 555 MB
../input/ranzcr-clip-catheter-line-classification/train_tfrecords is 4563 MB
../input/ranzcr-clip-catheter-line-classification/train is 6592 MB
../input/ranzcr-clip-catheter-line-classification is 12524 MB

Check train file

  • Let’s descirbe train
train = pd.read_csv('../input/ranzcr-clip-catheter-line-classification/train.csv', index_col = 0)
test = pd.read_csv('../input/ranzcr-clip-catheter-line-classification/sample_submission.csv', index_col = 0)
display(train.head())
display(test.head())
ETT - Abnormal ETT - Borderline ETT - Normal NGT - Abnormal NGT - Borderline NGT - Incompletely Imaged NGT - Normal CVC - Abnormal CVC - Borderline CVC - Normal Swan Ganz Catheter Present PatientID
StudyInstanceUID
1.2.826.0.1.3680043.8.498.26697628953273228189375557799582420561 0 0 0 0 0 0 1 0 0 0 0 ec89415d1
1.2.826.0.1.3680043.8.498.46302891597398758759818628675365157729 0 0 1 0 0 1 0 0 0 1 0 bf4c6da3c
1.2.826.0.1.3680043.8.498.23819260719748494858948050424870692577 0 0 0 0 0 0 0 0 1 0 0 3fc1c97e5
1.2.826.0.1.3680043.8.498.68286643202323212801283518367144358744 0 0 0 0 0 0 0 1 0 0 0 c31019814
1.2.826.0.1.3680043.8.498.10050203009225938259119000528814762175 0 0 0 0 0 0 0 0 0 1 0 207685cd1
ETT - Abnormal ETT - Borderline ETT - Normal NGT - Abnormal NGT - Borderline NGT - Incompletely Imaged NGT - Normal CVC - Abnormal CVC - Borderline CVC - Normal Swan Ganz Catheter Present
StudyInstanceUID
1.2.826.0.1.3680043.8.498.46923145579096002617106567297135160932 0 0 0 0 0 0 0 0 0 0 0
1.2.826.0.1.3680043.8.498.84006870182611080091824109767561564887 0 0 0 0 0 0 0 0 0 0 0
1.2.826.0.1.3680043.8.498.12219033294413119947515494720687541672 0 0 0 0 0 0 0 0 0 0 0
1.2.826.0.1.3680043.8.498.84994474380235968109906845540706092671 0 0 0 0 0 0 0 0 0 0 0
1.2.826.0.1.3680043.8.498.35798987793805669662572108881745201372 0 0 0 0 0 0 0 0 0 0 0

Definitions of Variables

  • What’s inside data?
    • StudyInstanceUID - unique ID for each image
    • ETT - Abnormal - endotracheal tube placement abnormal
    • ETT - Borderline - endotracheal tube placement borderline abnormal
    • ETT - Normal - endotracheal tube placement normal
    • NGT - Abnormal - nasogastric tube placement abnormal
    • NGT - Borderline - nasogastric tube placement borderline abnormal
    • NGT - Incompletely Imaged - nasogastric tube placement inconclusive due to imaging
    • NGT - Normal - nasogastric tube placement borderline normal
    • CVC - Abnormal - central venous catheter placement abnormal
    • CVC - Borderline - central venous catheter placement borderline abnormal
    • CVC - Normal - central venous catheter placement normal
    • Swan Ganz Catheter Present(??)
    • PatientID - unique ID for each patient in the dataset

Data Distribution of Each Variable

  • why two calculations are different?
    • When inserting catheters and lines into patients, some patients needs them to put on multiple positions.
    • Let’s see PatientID - bf4c6da3c
  • But, you realize that three groups - ETT, NGT, CVC counted seperately.
print("Total Rows of Train Data is", len(train))
print("Total Count of Each Variable in Train Data is", train.iloc[:, :-1].sum().sum())

var_cal_tmp = train.iloc[:, :-1].sum()
print(var_cal_tmp)
Total Rows of Train Data is 30083
Total Count of Each Variable in Train Data is 50619
ETT - Abnormal                   79
ETT - Borderline               1138
ETT - Normal                   7240
NGT - Abnormal                  279
NGT - Borderline                529
NGT - Incompletely Imaged      2748
NGT - Normal                   4797
CVC - Abnormal                 3195
CVC - Borderline               8460
CVC - Normal                  21324
Swan Ganz Catheter Present      830
dtype: int64
train.iloc[1].to_frame().T
ETT - Abnormal ETT - Borderline ETT - Normal NGT - Abnormal NGT - Borderline NGT - Incompletely Imaged NGT - Normal CVC - Abnormal CVC - Borderline CVC - Normal Swan Ganz Catheter Present PatientID
1.2.826.0.1.3680043.8.498.46302891597398758759818628675365157729 0 0 1 0 0 1 0 0 0 1 0 bf4c6da3c

Quick Visualization

  • In general, CVC outnumbered other group.
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x = var_cal_tmp.values, y = var_cal_tmp.index, ax=ax)
ax.tick_params(axis="x", labelsize=14)
ax.tick_params(axis="y", labelsize=14)
ax.set_xlabel("Number of Images", fontsize=15)
ax.set_title("Distribution of Labels", fontsize=15)
Text(0.5, 1.0, 'Distribution of Labels')

png

  • The number of Patients are smaller than total data.
  • It means some patients are frequently checked, depending upon patients
print("Number of Unique Patients: ", train["PatientID"].unique().shape[0])
print("Number of Total Data: ", len(train["PatientID"]))
Number of Unique Patients:  3255
Number of Total Data:  30083
tmp = train['PatientID'].value_counts()
print(tmp)
fig, ax = plt.subplots(figsize=(24, 6))
sns.countplot(x = tmp.values, ax=ax)
ax.tick_params(axis="x", labelsize=10)
ax.tick_params(axis="y", labelsize=14)
ax.set_xlabel("Number of Images", fontsize=15)
ax.set_title("Distribution of Labels", fontsize=15)
05029c63a    172
55073fece    167
26da0d5ad    148
8849382d0    130
34242119f    110
            ... 
ad32e88e0      1
7755053cb      1
2d5a5f0d0      1
1951dc11c      1
22e8f333f      1
Name: PatientID, Length: 3255, dtype: int64





Text(0.5, 1.0, 'Distribution of Labels')

png

  • Now, we need to see the distribution of data in each variable.
target_cols = ['ETT - Abnormal', 'ETT - Borderline', 'ETT - Normal', 'NGT - Abnormal', 
               'NGT - Borderline', 'NGT - Incompletely Imaged', 'NGT - Normal', 'CVC - Abnormal',
               'CVC - Borderline', 'CVC - Normal', 'Swan Ganz Catheter Present']

fig, ax = plt.subplots(4, 3, figsize=(16, 10))
for i, col in enumerate(train[target_cols].columns[0:]):
  print(i, col)
  if i <= 2:
    ax[0, i].hist(train[col].values)
    ax[0, i].set_title(f'target: {col}')
  elif i <= 5:
    ax[1, i-3].hist(train[col].values)
    ax[1, i-3].set_title(f'target: {col}')
  elif i <= 8:
    ax[2, i-6].hist(train[col].values)
    ax[2, i-6].set_title(f'target: {col}')
  else:
    ax[3, i-9].hist(train[col].values)
    ax[3, i-9].set_title(f'target: {col}')

fig.tight_layout()
fig.subplots_adjust(top=0.95)
0 ETT - Abnormal
1 ETT - Borderline
2 ETT - Normal
3 NGT - Abnormal
4 NGT - Borderline
5 NGT - Incompletely Imaged
6 NGT - Normal
7 CVC - Abnormal
8 CVC - Borderline
9 CVC - Normal
10 Swan Ganz Catheter Present

png

  • How to interpret the graph?
    • CVC group is the top most amongst groups
    • In each group, Normal is the top most.
  • This datasets are typically imbalanced, and multi-classification problem is revealed.

Background Knowledge

  • Since my major is far from this medical area, it difficults to figure what to classify from images.
  • So, need some videos to understand the processing.
  • Thanks to RANZCR CLiP: Visualize and Understand Dataset
    • Please visit here and upvote

Endotracheal Tube¶

  • It’s so called ETT in this dataset.
from IPython.display import YouTubeVideo
YouTubeVideo('FtJr7i7ENMY')

Nasogastric Tube

  • It’s so called NTT in this dataset.
YouTubeVideo('Abf3Gd6AaZQ')

Central venous catheter

  • It’s so called CVC in this dataset.
YouTubeVideo('mTBrCMn86cU')

Swan Ganz Catheter Present

  • It’s Swan Ganz Catheter Present
YouTubeVideo('YkN30T6ig30')

Check train annotation file

  • What’s Inside train_annotations file?
    • The main purpose is said that ‘These are segmentation annotations for training samples that have them. They are included solely as additional information for competitors.’
  • Let’s look at data
annot = pd.read_csv("../input/ranzcr-clip-catheter-line-classification/train_annotations.csv")
annot.head(10)
StudyInstanceUID label data
0 1.2.826.0.1.3680043.8.498.12616281126973421762... CVC - Normal [[1487, 1279], [1477, 1168], [1472, 1052], [14...
1 1.2.826.0.1.3680043.8.498.12616281126973421762... CVC - Normal [[1328, 7], [1347, 101], [1383, 193], [1400, 2...
2 1.2.826.0.1.3680043.8.498.72921907356394389969... CVC - Borderline [[801, 1207], [812, 1112], [823, 1023], [842, ...
3 1.2.826.0.1.3680043.8.498.11697104485452001927... CVC - Normal [[1366, 961], [1411, 861], [1453, 751], [1508,...
4 1.2.826.0.1.3680043.8.498.87704688663091069148... NGT - Normal [[1862, 14], [1845, 293], [1801, 869], [1716, ...
5 1.2.826.0.1.3680043.8.498.87704688663091069148... CVC - Normal [[906, 604], [1103, 578], [1242, 607], [1459, ...
6 1.2.826.0.1.3680043.8.498.87704688663091069148... ETT - Normal [[1781, 804], [1801, 666], [1791, 496], [1798,...
7 1.2.826.0.1.3680043.8.498.53113362093090654004... CVC - Normal [[1152, 938], [1193, 856], [1265, 795], [1362,...
8 1.2.826.0.1.3680043.8.498.83331936392921199432... NGT - Normal [[1903, 73], [1934, 768], [1917, 1061], [1866,...
9 1.2.826.0.1.3680043.8.498.83331936392921199432... CVC - Normal [[92, 1857], [163, 1936], [251, 1917], [282, 1...

Visualization of X-rays image

  • combined train + train_annotations, let’s draw sample image
from PIL import Image, ImageDraw

def train_base_chest_plot(row_ind, base_dir):
    row = annot.loc[row_ind]
    train_img = Image.open(base_dir + row['StudyInstanceUID'] + '.jpg')
    uid = row['StudyInstanceUID']
    label = row['label']
    fig, ax = plt.subplots(figsize=(15, 6))
    ax.imshow(train_img)
    plt.title(f"train: {label}")

base_dir = '../input/ranzcr-clip-catheter-line-classification/train/'
train_base_chest_plot(1, base_dir)

png

  • But, what we need is to draw tube. Thus, we need to use column ‘data’ in this plot. Let’s do this.
import ast 
import numpy as np

def train_base_tube_plot(row_ind, base_dir):
    row = annot.loc[row_ind]
    train_img = Image.open(base_dir + row['StudyInstanceUID'] + '.jpg')
    uid = row['StudyInstanceUID']
    label = row['label']
    data = np.array(ast.literal_eval(row['data']))
    fig, ax = plt.subplots(figsize=(15, 6))
    ax.imshow(train_img)
    ax.plot(data[:, 0], data[:, 1], color = 'b', linewidth=2, marker='o')
    plt.title(f"train: {label}")

base_dir = '../input/ranzcr-clip-catheter-line-classification/train/'
train_base_tube_plot(1, base_dir)
train_base_tube_plot(2, base_dir)
train_base_tube_plot(25, base_dir)

png

png

png

  • Well, still difficult to figure out what the difference between normal and abnormal is. So, Droped to draw more.