Anda di halaman 1dari 40

MICROARRAYS FOR

GENE EXPRESSION ANALYSIS

Rajeev K Sukumaran, PhD


Biotechnology Division,
Council of Scientific and Industrial Research-Regional Research Laboratory
Tiruvananthapuram 19
Gene Expression…..what makes the cells different ?
With very few exceptions, every cell of the body contains a full set of
chromosomes and identical genes. Only a fraction of these genes are turned
on at any given time.

It is the subset that is expressed which confers unique properties to each


cell type.

Gene expression is the term used to describe the transcription of the


information contained within the DNA into mRNA which is then translated to
proteins which are the effectors of cellular function

The types and amount of mRNA produced by a cell determines the way it
responds to the changing needs.

Gene expression is a highly complex and tightly regulated process that


allows a cell to respond dynamically both to environmental stimuli and to its
own changing needs.
How is the molecular basis of complex biological processes studied ?

Condition A Condition B

Natural Causes
Induced ( Chemical or Physical)
Other
components,
mRNAs Proteins organelles
mRNAs proteins
Other
components,
organelles

A vs B – comparison – Qualitative and Quantitative


A B gene q and gene u are over
gene p 1 2 expressed in cell/condition B
gene q 22 321 gene t is under expressed in
gene r 12 9 condition B
gene s 58 41
gene t 102 19
gene u 0 56

Product is member of a known Product previously


pathway/transcription activation/protein unknown…
modification system etc

Characterize the
Over expression gene – express
Study the pathway etc – gene knock out protein .. Deduce
Other members studies – in vitro function etc
or in vivo
Gene Expression Analysis – The concepts

 Proteins are the effectors of cellular machinery

 Cells are in a dynamic equilibrium and majority of the proteins have a


rapid turn over rate maintained by active transcription and ubiquitin
mediated protein degradation

 Gene expression is the synthesis of proteins coded by specific genes in


the spatio-temporal context

How do we measure gene expression ?

Quantification of specific transcripts (mRNAs) – RT-PCR, qRTPCR,


hybridization
Quantification of specific proteins - immuno-cytochemical methods
Trancriptome Analysis- The Rationale

Transcriptome of a cell is the expressed gene set at any given time ( the profile
of mRNAs )

The classical approach involve studies on a gene or a few genes, one at a time .
This information is biased since it is based on pre-defined knowledge

Advantage- More focused and deeper understanding of a gene’s function in a


defined situation

Disadvantage – Novel genes involved in the condition, and intermediate players


are totally missed
Large Scale Gene Expression Analysis – Transcriptome Profiling

Recent approaches to gene expression analysis involves studying the entire


set of expressed mRNAs – the transcriptome. The approach is totally
unbiased and the differences in expression levels between two
cell(s)/tissues/conditions can give valuable cues into the molecular basis of
the situation studied

How do we profile mRNAs ?

• cDNA cloning and sequencing


• ESTs

• Microarrays
• Serial Analysis of Gene Expression (SAGE)
• Massively Parallel Signature Sequencing ( MPSS)
Microarrays –the methodology for quantitative information
on expressed genes

There are approximately 500,000 mRNA molecules in a eukaryotic


cell

This figure is representative of 20,000 to 60,000 unique mRNA


species depending on the cell type

The absolute levels and type of unique mRNA species in a cell will
vary in response to cell-signaling cues or perturbations in the extra-
cellular environment
DNA microarray technology – How do we profile the transcriptome

A microarray works by exploiting the ability of a given mRNA molecule to bind


specifically to, or hybridize to, the DNA template from which it originated.

By using an array containing many DNA samples, scientists can determine, in


a single experiment, the expression levels of hundreds or thousands of genes
within a cell by measuring the amount of mRNA bound to each site on the
array.

With the aid of a computer, the amount of mRNA bound to the spots on the
microarray is precisely measured, generating a profile of gene expression in
the cell.
The Basics – Nucleic Acid Hybridization

Microarrays are dot


hybridization
experiments in a larger
scale !!!
DNA Microarrays are small, solid
supports onto which the sequences from
thousands of different genes are
immobilized, or attached, at fixed
locations. The supports themselves are
usually glass microscope slides, but
can also be silicon chips or nylon
membranes. The DNA is printed,
spotted, or actually synthesized directly
onto the support.

Probes for genes are spotted as arrays-


The information on each spot –The gene it
represents and annotations wherever
available are stored in a database
Types of Microarrays
cDNA microarrays : probe
cDNA (500~5,000 bases long)
is immobilized to a solid
surface such as glass using
robot spotting and exposed to a
set of targets either separately
or in a mixture.

Gene Chip: an array of oligo-


nucleotides (20~80-mer oligos) is
synthesized either in situ (on-chip)
or by conventional synthesis
followed by on-chip immobilization.
Microarray Experiment –Array hybridization, Probing…….

Extract Total RNA from cells


(Optional – isolate mRNA)

Convert to cDNA in presence of


Amino -allyl –dUTP

Label with Cy3 or Cy5


fluorescent dye linking to AA-
dUTP

Hybridize labeled mRNA to Array

Wash the unhybridized material


away

Scan the slide using laser


confocal scanner
YOU GOT A PICTURE …… SO WHAT ?

Where Experiment Stops …

Informatics takes Over


Steps in Microarray Data Analysis

Image Analysis
Normalization
t-Test, ANOVA
Principal Component Analysis
Clustering
Ontologies/Pathway Analysis/Function Prediction……and
other Knowledge based analyses
Image Analysis

Analysis of the image of scanned array seeks to extract an intensity for each spot or
feature on the array
Each spot is identified by gridding so as to include all spots within defined grids
• Segmentation is done by analyzing the spot intensities using various algorithms to
identify the best shape of the spot and which pixels belong to a spot

• Intensity Extraction for each spot and background is done by measuring the mean or
median intensity of all pixels within the spot

• Background Correction – Local or Global background corrections are done to remove


noise
Free Software for Array Image Analysis

P-Scan : National Institutes of Health http://abs.cit.nih.gov/psacan/index.html


ScanAlyze : Lawrence Berkeley National Lab http://rana.lbl.gov/EisenSoftware.htm
Spotfinder : The Institutes for Genomic Research http://www.tigr.org/software/tm4/spotfinderfinder.html
Obtaining digital data for analyses
Software for image analyses will extract spot intensity data which can be
expressed as numeric values
Controls for each experiment is needed for obtaining the levels of expression in
the studied condition
Comparisons are done after normalization of the data
Normalization methods
Internal controls ( one or more genes is assumed to be expressed at constant
rate)
Sum of genes is assumed as constant
Signal dependent normalization ( subset of genes whose exp values does not
differ more than a threshold assumed to be constant)
Spike controls- Known amount of foreign RNA control is incorporated in the
experiment which serves as control
Comparison of Expression Levels
Fold change
Divide the expression level of a gene in the sample by the expression level of
the same gene in the control
1- Unchanged

<1-downregulated
>1-upregulated

Significance
A fold difference in expression may be an experimental error. Replicates of
control and sample are done and statistical testing is done on the data to see
if the differences are actually significant !
T-test or ANOVA
Sample data – Normalized Expression values of genes
Differences in gene expression patterns when hESCs are treated with
BMP4- a growth factor
The analyses which are to be performed on Microarray data depends on
the needs of the user .. What problem is he/she addressing

In general, the analysis includes comparisons across several libraries to


obtain information what is different or what is similar

The techniques used are mostly cluster analyses, including


• Hierarchical Cluster Analysis
• Self Organizing Maps
• Machine learning approaches
General Work Flow
Knowledge based
( eg pathways)
Normalization
Microarray Dimension
across Dataset
Data Files reduction
experiments
Expression Criteria
FD, Significance

Preprocessing

Log transformation
Dimension
Mean centering
reduction
Median centering
Confirmation –
Knowledge based Normalizations
Experiments
( eg pathways)

Analysis

Selection of
Similarity Measure

Processed Data Selection of


retrieval Cluster Display Clustering method

Selection of
clustering rules
Dimension reduction
Microarray data is multidimensional
Each Gene is compared across multiple libraries/experiments
And Each Library/experiment is compared across multiple genes

The data is basically a matrix, with “m” columns and “n” rows whose number can be
enormous. Any mathematical operation on such huge matrices are computationally
intensive even challenging the supercomputers

With the increase in size of matrix, the memory requirements increase


exponentially and not linearly !
How to reduce the dimensions
One is interested only in the genes which vary from the general trend for the set of
compared libraries
Principle component analysis is a powerful method to reduce the dimensions of the data.
The entire matrix is solved by classical matrix solution methodologies to yield eigen values
and eigen vectors which represent the matrix properties.
The whole matrix can now be arranged into <=n principal components representative of
either genes or experiments. When viewed in 3D space, the principle components show a
pattern of direction and orientation. The principle components which deviate from the
major trend are the significantly variant experiments or genes.

Many programs allow a graphical selection of PCAs


and their component genes.

Important – PCAs do not give much information


on the relationship between genes but will of
course provide low level information on co -
variance
Criteria based dimension reduction
Fold Difference
Tests of Significance

Tests of significance are done to assess the


evidence provided by data in favor of some claim ( in Knowledge based dimension reduction
our case gene expression is different between two
libs). Based on Functional Classifications

Hypothesis testing Gene Ontologies


Null hypothesis- any statement being tested in a test (Useful when one is interested in only a
of significance is called null hypothesis. Our case the
select set of genes –say a particular
null hypothesis is that there is no difference between
two Libs. If the probability of this is very low , it pathway)
means that the libraries show significant difference
between expression of genes tested.
The probability function is called p-value and p-value
can be derived only based on a test statistic ( z
statistic )
A p-value of 0.05 means that there is only 5%
chance that the null hypothesis is correct.
5 out of 100 genes can be similar in expression !
P value 0.01 , 1 out of every 100 genes can be
similar
P value of 0.001 , 0.1 out of every 100 genes can be
similar
OK we brought down the number of genes from 35000 to 2000 ! What Next ?

What are the genes over expressed under a given situation?


What are the genes which are turned off under a given condition ?
What is the pattern of Gene Expression ?
Are they Co-regulated ?
Are the co-regulated genes members of some pathway?

HOW DO WE ANSWER THESE QUESTIONS ?

Comparison of gene expression against libraries (Arrays) gives information on


over expressed genes and the gene expression patterns

Pair wise comparisons for over expression analysis


Cluster analyses to obtain expression patterns
Software for Gene Expression analyses

Both Commercial and Free License


software are available for gene
expression data analyses and in
general offers HCA, SOM and some
preprocessing and statistical analysis
functions

http://genome.tugraz.at

Steps in processing include, getting the


data into the required format ( Stanford file
format), Normalizing the number of tags,
and Log transformation of the data
Mean or median centering is done in most
cases
Desired distance measure is selected
(usually pearson correlation)
http://www.silicongenetics.com
Data Processing
Preparing the Data
Generally the clustering software allow three columns and two rows of information
before the data matrix. The first row is column descriptions, and the first column is
the Unique ID ( Tag sequence). The second column can contain a user defined
entry ( Gene Name, Symbol, Unigene ID etc) . The third column contains an
optional entry –The weights of which can be assigned to each row, and the second
row also can have an optional entry, the weights that can be assigned to each
column.
This format is called the Stanford file format
PAIRWISE COMPARISONS

Microarray libraries can be compared one to one for identifying genes over
expressed /up regulated in one library vs the other.
Genes up regulated in a library
can be selected from
interactive scatter plots
Ovary –Normal Epithelium

Three dimensional comparisons

Three libraries can


be compared

Ovary –Tumor3
using 3D scatter
plots

Ovary -Tumor

Selected genes can be further analyzed by


experimental methods.
Multi-dimensional comparisons- Cluster Analyses

Cluster Analyses groups genes /experiments together based on their level of


similarity
How does the computer know genes/samples are similar ?
GENE Library1 Library2 Library3 200

180

A 20 10 40 160

140
Gene A
Gene B
Gene C
Gene D

B 23 24 25 120

100
Gene E

80
C 56 28 112 60

40

D 2 1 4 20

E 172 0 21 0.5 1. 0 1. 5 2.0 2.5 3.0 3. 5

Genes A, C and D are similar and this similarity can be mathematically defined as
“correlation co-efficient”
The measure of correlation between any two genes can be defined as the distance between
them. The distance measure ranges from 0-1 and higher correlation value means higher
similarity
Common distance measures used in gene expression analyses
Standard correlation, Pearson Correlation, Manhattan distance, Cosine distance , Poisson
distance
Distance Measures are used to find the similarity between genes and
experiments
Genes Closer to one another are grouped together and displayed as cluster
diagrams which can either be colored blocks, line graphs or scatter plots

Methodologies for Cluster Analysis differs in the algorithms used to implement


them and can be based on Statistical methods ( Hierarchical Cluster Analysis.
K means Clustering) or based on Machine learning approaches ( Self
Organizing Maps, Support Vector Machines< Neural Network Methods)

Methodologies following matrix solution methods – Principle Component


Analysis
Hierarchical Cluster Analysis

Hierarchical cluster analysis is a statistical method for finding relatively


homogeneous clusters of observations based on measured characteristics.

HCA produces trees of relationships , In our case gene trees and Sample trees

Sample tree Gene Tree


Example -HCA on Ovarian Cancer Libraries

Clustering of Samples- Ovarian


cancers are closely related to gastric
and Pancreas cancers

Ovarian cancer specific gene


clusters groups known epithelial
ovarian cancer genes
Mesothelin
Mucin1,
Ceruloplasmin

Genes sharing clusters with


known cancer genes may be
potential candidates for study !
CXCL1 -Chem okine (C-X-C m otif) ligand 1
(m elanoma growth stimulating activity, alpha)
IL4I1 – Interleukin4 induced protein 1
EZI-Zinc Finger Protein EZI
Functions of known genes co-expressed with cancer genes can be recovered
from ontology databases ( AmiGO) or using locally installed software
EASE - http://david.niaid.nih.gov/david/ease.htm
Self Organizing Map (SOM) Analyses

Self-Organizing Map (SOM) is a neural-network array the cells (or nodes) of which
become specifically tuned to recognize signal patterns or classes of patterns in an
orderly fashion.

The learning process is competitive and unsupervised, and the locations of the
responses in the array tend to become ordered in the learning process thus
producing meaningful groupings of data based on their similarities

Data format for feeding into software is similar to HCA but the user needs to
specify the number of desired groupings, the number of iterative analyses to be
performed on each gene before output of result, and define the neighborhood
radius.
SOM analysis of Ovarian Cancer data

SOM identifies patterns of gene


expression !

SOM cluster showing genes upregulated in ovarian


cancers- (Includes MUC1)

Genes in the SOM clusters specific for the particular case of study can be
retrieved and studied further experimentally
DEMONSTRATION OF ANALYSIS
GENESIS SOFTWARE
% Expression

0
2
4
6
8
10
12
14
16
18
Ap
op
Ce to
ll C Ce ssi
ycl llC
yc

Differentiated
eC le

Undifferentiated
Ce heck
ll p Po
r ol in
i t
Dif ferat
f i
DN eren on
A tia
t
mo io
difi n
c
DN atoi n
DN A re
A p
rep air
lica
tio
n
Gr FG EC
ow F M
th pa
Fa thw
cto
rs/ ay

Functional Categories
Lig
an
d
Kins
Tra a
ns Sig Re ses
crpi na cep
tio l Tr tor
nF ans s
ac du
tor cti
on
s/C
oW-f
TG natct
Comparisonof theexpressionof genesbelongingtoselect categories

Poars
Fb thw
eta ay
sgi
na
ling
Demonstration of EASE software
Gene Expression Data Can Be catalogued-

Case Study with Human Embryonic Stem Cell pluripotency


Further Analyses of Identified Genes important in a particular case/sample

Informatics ends here and


experimental methodologies take
over for final confirmations and
analyses

PCR for expression analyses

Knockout and/or Knock Ins for


functional analyses

Traditional Biology – Biochemical,


Immunological, Molecular Biology,
or Physiology Methods
Microarrays have made the study of cellular processes more comprehensive and
less biased !

We have more information now with less experiments and the only limitations are
in the analysis of data and its interpretations

Bioinformatics can help scientists to understand life better by providing


knowledge based solutions
THANK YOU

Anda mungkin juga menyukai