Group: 7
Presented By:
Sukanya Samanta
Under the guidance of: Suraj Pandey
Prof. Gitosree Khan
OBJECTIVE
Analysing Gene Alterations to stratify individuals who are predisposed to
a higher risk of cancer and somatic mutations to profile tumor
characteristic for precise therapy selection.
Converting
unstructured Data
to Structured
Integrity Removing
Check Extra Data
First column "Complete TCGA ID" is used to match the sample IDs in the main cancer proteomes file.
All other columns have self-explanatory names, contain data about the cancer classification of a given sample using
different methods.
File: PAM50_proteins.csv
Contains the list of genes and proteins used by the PAM50 classification system. The column RefSeqProteinID contains the
protein IDs that can be matched with the IDs in the main protein expression data set.
DISCUSSION
For gini
Cluster Visualization
FUTURE SCOPE / LIMIT
1. The models for supervised learning can be iterated so that they can achieve the higher accuracy.
3. Overall, due to less number of samples, it was difficult to build the models with higher accuracy.
Hence as more data comes in, these models can be reiterated.
4. Well its not always applicable to every dataset. To choose our model we always need to analyze our
dataset and then apply our machine learning model
Jupyter Notebook
CONCLUSION
So finally we have built our classification model and we can see that Random Forest Classification
algorithm, SVM and K Nearest Neighbour algorithm give the best results for our dataset. Based on different
attributes (primarily mass spectrometry analysis results for 12553 proteins) few classification algorithms
were implemented to see if the model can generate the accurate label of cancer type.
REFERENCES / BIBLIOGRAPHY
• https://www.biomedcentral.com/track/pdf/10.1186/2047-2501-2-
3?site=hissjournal.biomedcentral.com
•
https://www.researchgate.net/profile/Lucia_Del_Mastro/publication/6601028_Five_Years_of_Letroz
ole_Compared_With_Tamoxifen_As_Initial_Adjuvant_Therapy_for_Postmenopausal_Women_With
_Endocrine-Responsive_Early_Breast_Cancer_Update_of_Study_BIG_1-
98/links/5673622308aee7a4274388fc.pdf
• https://towardsdatascience.com/building-a-simple-machine-learning-model-on-breast-cancer-data-
eca4b3b99fa3
• Cancer Genomics (29th June, 2015) , Authors : Md Ibrahim, Rabah Jabbar, Sami A. Al-mudhaffar.
THANK YOU!