Mathematical Model of Classification of Human Genome Data For Breast Cancer

Mathematical model of classification
of Human Genomic data for breast

cancer
Group: 7
Presented By:
Sukanya Samanta
Under the guidance of: Suraj Pandey
Prof. Gitosree Khan
OBJECTIVE
Analysing Gene Alterations to stratify individuals who are predisposed to
a higher risk of cancer and somatic mutations to profile tumor
characteristic for precise therapy selection.
To inform, educate and help Cancer Treatment and Research.

INTRODUCTION
● Breast Cancer is a multifactorial disease that forms in the cells of the breast. Breast cancer can occur in both men and
women, but it's far more common in women.
● There are many factors that increase one’s chances in developing breast cancer, mutations in DNA are often the root
causes of breast cancer. Other causes that may influence one’s susceptibility to developing breast cancer are
mutations to genes that assist in cell differentiation (Proto-Oncogenes), and mutations to genes that modulate cell
division, cellular repair, apoptosis (Tumor Suppressor Genes).
● Breast cancer awareness and research funding has helped created advances in the diagnosis and treatment of breast
cancer. Medical advances and Innovation in treating breast cancer has increased survival rates, lower remission rates,
and lower the number of deaths associated with the disease.
● More recently, the introduction of precision medicine and gene therapy has the potential to transform how physician
treat breast cancer. Precision medicine refers to the tailoring of medical treatment based on the cellular profile of a
disease and the patient’s genome.
PROPOSED WORK: Work Flow
Collecting
Datasets
Converting
unstructured Data
to Structured
Integrity Removing
Check Extra Data
Selecting the best

model by applying
Machine learning
algorithms
Getting the best

result of our dataset
EXTENDED WORK: DATA SETS
File: 77_cancer_proteomes_CPTAC_itraq.csv:
Includes 12553 unique genes from a total of 89 breast cancer patients

File: clinical_data_breast_cancer.csv:
Contains Clinical information from 105 breast cancer patients.
First column "Complete TCGA ID" is used to match the sample IDs in the main cancer proteomes file.
All other columns have self-explanatory names, contain data about the cancer classification of a given sample using
different methods.
File: PAM50_proteins.csv
Contains the list of genes and proteins used by the PAM50 classification system. The column RefSeqProteinID contains the
protein IDs that can be matched with the IDs in the main protein expression data set.
DISCUSSION
Proportion Of Different Tumor Types

Using a Pie Chart
BAR PLOT of OS Time
LINE CHART
Data Transformation (PCA Implementation)
Dataframe after converting into principle components
PCA Transformation
SUPERVISED LEARNING: CLASSIFICATION
WITH SUPPORT VECTOR MACHINES
SVM Working Principle

Results of classification carried out using support vector machines
are summarized in a confusion matrix
Accuracy for test data is

equal to 77.78%.
SUPERVISED LEARNING: CLASSIFICATION WITH
MULTINOMIAL LOGISTIC REGRESSION
The accuracy obtained is 66.67%

SUPERVISED LEARNING: CLASSIFICATION WITH K
NEAREST NEIGHBORS
With KNN classifier, accuracy

obtained is 77.78%
DECISION TREES
For gini
Accuracy score for Decision tree

classifier (gini) with test data set is:
48.14 %
For entropy
Accuracy score for Decision tree

classifier (entropy) with test data set
is: 59.25 %
GRADIENT BOOSTING (ENSEMBLE TECHNIQUES)
Accuracy score for Gradient boosting

classifier with test data set is: 70.37 %
RANDOM FOREST (ENSEMBLE TECHNIQUES)
Accuracy score for Random Forest

classifier with test data set is 77.78 %
UNSUPERVISED LEARNING: K MEANS CLUSTERING
Cluster Visualization
FUTURE SCOPE / LIMIT
1. The models for supervised learning can be iterated so that they can achieve the higher accuracy.
2. Parameters of these classification techniques can be tweaked to optimized.
3. Overall, due to less number of samples, it was difficult to build the models with higher accuracy.
Hence as more data comes in, these models can be reiterated.
4. Well its not always applicable to every dataset. To choose our model we always need to analyze our
dataset and then apply our machine learning model
4. It can be used by future researchers , by hospital.

SOFTWARE REQUIREMENTS
Jupyter Notebook
CONCLUSION
So finally we have built our classification model and we can see that Random Forest Classification
algorithm, SVM and K Nearest Neighbour algorithm give the best results for our dataset. Based on different
attributes (primarily mass spectrometry analysis results for 12553 proteins) few classification algorithms
were implemented to see if the model can generate the accurate label of cancer type.
REFERENCES / BIBLIOGRAPHY
• https://www.biomedcentral.com/track/pdf/10.1186/2047-2501-2-
3?site=hissjournal.biomedcentral.com
•
https://www.researchgate.net/profile/Lucia_Del_Mastro/publication/6601028_Five_Years_of_Letroz
ole_Compared_With_Tamoxifen_As_Initial_Adjuvant_Therapy_for_Postmenopausal_Women_With
_Endocrine-Responsive_Early_Breast_Cancer_Update_of_Study_BIG_1-
98/links/5673622308aee7a4274388fc.pdf
• https://towardsdatascience.com/building-a-simple-machine-learning-model-on-breast-cancer-data-
eca4b3b99fa3
• Cancer Genomics (29th June, 2015) , Authors : Md Ibrahim, Rabah Jabbar, Sami A. Al-mudhaffar.
THANK YOU!

Mathematical Model of Classification of Human Genome Data For Breast Cancer

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Mathematical Model of Classification of Human Genome Data For Breast Cancer

Diunggah oleh

Hak Cipta:

Format Tersedia

Mathematical model of classification

of Human Genomic data for breast

To inform, educate and help Cancer Treatment and Research.

Selecting the best

Getting the best

Includes 12553 unique genes from a total of 89 breast cancer patients

Contains Clinical information from 105 breast cancer patients.

Proportion Of Different Tumor Types

SVM Working Principle

Accuracy for test data is

The accuracy obtained is 66.67%

With KNN classifier, accuracy

Accuracy score for Decision tree

Accuracy score for Decision tree

Accuracy score for Gradient boosting

Accuracy score for Random Forest

2. Parameters of these classification techniques can be tweaked to optimized.

4. It can be used by future researchers , by hospital.

Anda mungkin juga menyukai