import os
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
Check File Size
Check Each Size of Dataset Folder in this competition
train_records = 4.5GB
test_tfrecords = 0.5MB
train (image data) = 6.5GB
test (image data) = 0.8MB
import os
defget_folder_size(file_directory):
# file_list = os.listdir(file_directory) dir_sizes = {}
for r, d, f in os.walk(file_directory, False):
size = sum(os.path.getsize(os.path.join(r,f)) for f in f+d)
size += sum(dir_sizes[os.path.join(r,d)] for d in d)
dir_sizes[r] = size
print("{} is {} MB".format(r, round(size/2**20), 2))
base_dir ='../input/ranzcr-clip-catheter-line-classification'get_folder_size(base_dir)
../input/ranzcr-clip-catheter-line-classification/test is 805 MB
../input/ranzcr-clip-catheter-line-classification/test_tfrecords is 555 MB
../input/ranzcr-clip-catheter-line-classification/train_tfrecords is 4563 MB
../input/ranzcr-clip-catheter-line-classification/train is 6592 MB
../input/ranzcr-clip-catheter-line-classification is 12524 MB
NGT - Incompletely Imaged - nasogastric tube placement inconclusive due to imaging
NGT - Normal - nasogastric tube placement borderline normal
CVC - Abnormal - central venous catheter placement abnormal
CVC - Borderline - central venous catheter placement borderline abnormal
CVC - Normal - central venous catheter placement normal
Swan Ganz Catheter Present(??)
PatientID - unique ID for each patient in the dataset
Data Distribution of Each Variable
why two calculations are different?
When inserting catheters and lines into patients, some patients needs them to put on multiple positions.
Let’s see PatientID - bf4c6da3c
But, you realize that three groups - ETT, NGT, CVC counted seperately.
print("Total Rows of Train Data is", len(train))
print("Total Count of Each Variable in Train Data is", train.iloc[:, :-1].sum().sum())
var_cal_tmp = train.iloc[:, :-1].sum()
print(var_cal_tmp)
Total Rows of Train Data is 30083
Total Count of Each Variable in Train Data is 50619
ETT - Abnormal 79
ETT - Borderline 1138
ETT - Normal 7240
NGT - Abnormal 279
NGT - Borderline 529
NGT - Incompletely Imaged 2748
NGT - Normal 4797
CVC - Abnormal 3195
CVC - Borderline 8460
CVC - Normal 21324
Swan Ganz Catheter Present 830
dtype: int64
from IPython.display import YouTubeVideo
YouTubeVideo('FtJr7i7ENMY')
Nasogastric Tube
It’s so called NTT in this dataset.
YouTubeVideo('Abf3Gd6AaZQ')
Central venous catheter
It’s so called CVC in this dataset.
YouTubeVideo('mTBrCMn86cU')
Swan Ganz Catheter Present
It’s Swan Ganz Catheter Present
YouTubeVideo('YkN30T6ig30')
Check train annotation file
What’s Inside train_annotations file?
The main purpose is said that ‘These are segmentation annotations for training samples that have them. They are included solely as additional information for competitors.’