Anda di halaman 1dari 28

Mathematical model of classification

of Human Genomic data for breast


cancer

Group: 7
Presented By:
Sukanya Samanta
Under the guidance of: Suraj Pandey
Prof. Gitosree Khan
OBJECTIVE
Analysing Gene Alterations to stratify individuals who are predisposed to
a higher risk of cancer and somatic mutations to profile tumor
characteristic for precise therapy selection.

To inform, educate and help Cancer Treatment and Research.


INTRODUCTION
● Breast Cancer is a multifactorial disease that forms in the cells of the breast. Breast cancer can occur in both men and
women, but it's far more common in women.
● There are many factors that increase one’s chances in developing breast cancer, mutations in DNA are often the root
causes of breast cancer. Other causes that may influence one’s susceptibility to developing breast cancer are
mutations to genes that assist in cell differentiation (Proto-Oncogenes), and mutations to genes that modulate cell
division, cellular repair, apoptosis (Tumor Suppressor Genes).
● Breast cancer awareness and research funding has helped created advances in the diagnosis and treatment of breast
cancer. Medical advances and Innovation in treating breast cancer has increased survival rates, lower remission rates,
and lower the number of deaths associated with the disease.
● More recently, the introduction of precision medicine and gene therapy has the potential to transform how physician
treat breast cancer. Precision medicine refers to the tailoring of medical treatment based on the cellular profile of a
disease and the patient’s genome.
PROPOSED WORK: Work Flow
Collecting
Datasets

Converting
unstructured Data
to Structured

Integrity Removing
Check Extra Data

Selecting the best


model by applying
Machine learning
algorithms

Getting the best


result of our dataset
EXTENDED WORK: DATA SETS
File: 77_cancer_proteomes_CPTAC_itraq.csv:

Includes 12553 unique genes from a total of 89 breast cancer patients


File: clinical_data_breast_cancer.csv:

Contains Clinical information from 105 breast cancer patients.

First column "Complete TCGA ID" is used to match the sample IDs in the main cancer proteomes file.

All other columns have self-explanatory names, contain data about the cancer classification of a given sample using
different methods.
File: PAM50_proteins.csv

Contains the list of genes and proteins used by the PAM50 classification system. The column RefSeqProteinID contains the
protein IDs that can be matched with the IDs in the main protein expression data set.
DISCUSSION

Proportion Of Different Tumor Types


Using a Pie Chart
BAR PLOT of OS Time
LINE CHART
Data Transformation (PCA Implementation)
Dataframe after converting into principle components
PCA Transformation
SUPERVISED LEARNING: CLASSIFICATION
WITH SUPPORT VECTOR MACHINES

SVM Working Principle


Results of classification carried out using support vector machines
are summarized in a confusion matrix

Accuracy for test data is


equal to 77.78%.
SUPERVISED LEARNING: CLASSIFICATION WITH
MULTINOMIAL LOGISTIC REGRESSION

The accuracy obtained is 66.67%


SUPERVISED LEARNING: CLASSIFICATION WITH K
NEAREST NEIGHBORS

With KNN classifier, accuracy


obtained is 77.78%
SUPERVISED LEARNING: CLASSIFICATION WITH
DECISION TREES

For gini

Accuracy score for Decision tree


classifier (gini) with test data set is:
48.14 %
For entropy

Accuracy score for Decision tree


classifier (entropy) with test data set
is: 59.25 %
SUPERVISED LEARNING: CLASSIFICATION WITH
GRADIENT BOOSTING (ENSEMBLE TECHNIQUES)

Accuracy score for Gradient boosting


classifier with test data set is: 70.37 %
SUPERVISED LEARNING: CLASSIFICATION WITH
RANDOM FOREST (ENSEMBLE TECHNIQUES)

Accuracy score for Random Forest


classifier with test data set is 77.78 %
UNSUPERVISED LEARNING: K MEANS CLUSTERING

Cluster Visualization
FUTURE SCOPE / LIMIT

1. The models for supervised learning can be iterated so that they can achieve the higher accuracy.

2. Parameters of these classification techniques can be tweaked to optimized.

3. Overall, due to less number of samples, it was difficult to build the models with higher accuracy.
Hence as more data comes in, these models can be reiterated.

4. Well its not always applicable to every dataset. To choose our model we always need to analyze our
dataset and then apply our machine learning model

4. It can be used by future researchers , by hospital.


SOFTWARE REQUIREMENTS

Jupyter Notebook
CONCLUSION

So finally we have built our classification model and we can see that Random Forest Classification
algorithm, SVM and K Nearest Neighbour algorithm give the best results for our dataset. Based on different
attributes (primarily mass spectrometry analysis results for 12553 proteins) few classification algorithms
were implemented to see if the model can generate the accurate label of cancer type.
REFERENCES / BIBLIOGRAPHY

• https://www.biomedcentral.com/track/pdf/10.1186/2047-2501-2-
3?site=hissjournal.biomedcentral.com

https://www.researchgate.net/profile/Lucia_Del_Mastro/publication/6601028_Five_Years_of_Letroz
ole_Compared_With_Tamoxifen_As_Initial_Adjuvant_Therapy_for_Postmenopausal_Women_With
_Endocrine-Responsive_Early_Breast_Cancer_Update_of_Study_BIG_1-
98/links/5673622308aee7a4274388fc.pdf

• https://towardsdatascience.com/building-a-simple-machine-learning-model-on-breast-cancer-data-
eca4b3b99fa3

• Cancer Genomics (29th June, 2015) , Authors : Md Ibrahim, Rabah Jabbar, Sami A. Al-mudhaffar.
THANK YOU!

Anda mungkin juga menyukai