1. The images were formatted as .mhd and .raw files. The header data
is contained in .mhd files and multidimensional image data is stored in
.raw files. We used SimpleITK library to read the .mhd files. Each CT
scan has dimensions of 512 x 512 x n, where n is the number of axial
scans. There are about 200 images in each CT scan.
2. There were a total of 551065 annotations. Of all the annotations
provided, 1351 were labeled as nodules, rest were labeled negative. So
there was a big class imbalance. The easy way to deal with it was to
under sample the majority class and augment the minority class
through rotating images.
Creating an image database
1. We had a total 6881 images in our training set and 1622 images in
our validation set.
2. Because the data required to train a CNN is very large, it was
desirable to train the model in batches. Loading all the training data
into memory is not always possible because we need enough
memory to handle it and the features too. So we loaded all the
images into a hdfs dataset using h5py library.
Lung Segmentation
Final XGBoost Model