Bioinformatics 1 - Microarrays

In this session I am going to introduce gene expression microarrays, which are used to characterise the RNA composition of a sample across thousands of genes simultaneously. This allows biologists to use discovery or data-driven approaches to their work. Typical examples of microarray use include comparing gene expression between wild-type and mutant samples, temporal profiling, assaying drug effects and identifying bio-markers for disease.

Microarray analysis falls into a category of high-throughput technologies that generate large multi-variate data sets for which principled methods need to be employed to accurately quantify changes. This begins with quality control and normalisation processes to allow meaningful cross-comparison, properly modelled statistical testing to identify changes and ends with an assimilation of the primary results into further downstream analyses to interpret the changes in a biologically meaningful way.


  1. Lecture slides can be found here
  2. R/Bioconductor - statistical computing for the biological sciences
  3. Importing microarray data
  4. Quality control, pre-processing and normalisation
  5. Principal component analysis (PCA)
  6. Differential expression calling
  7. Functional enrichment analysis
  8. Clustering


Log in to your DICE account

Please make a directory called bio1_microarray in your home directory and then change into it.

mkdir bio1_microarray
cd bio1_microarray

now download the following files into the directory using your web browser

next uncompress the archived data using the following command

tar -zxf *

next invoke R at the command line


Then execute the following lines to install the necessary libraries for the session :-


This will take a while as the components are compiled into the R package library.

For future reference a guide to installing R/Bioconductor can be found here

Part I - loading the microarray data

#load libraries

#set working directory

#load Affymetrix CEL files
data <- ReadAffy(); #loading CEL files for wild-type and mutant samples GFP +ve and -ve (16 chips)

#find out a bit about the object

#what's an AffyBatch

#some access methods
probes(data,which='pm'); #probe perfect match expression
boxplot(data); #box-whisker plots each chip
image(data[,1]); #see a chip image plot

#examine the slots
data@cdfName; #get the chip platform name
data@phenoData; #get the experimental data

#what's an AnnotatedDataFrame ?

#lets annotate on the experimental information
pData(data)$condition <- as.factor(c(rep('wt',8),rep('mut',8)));
pData(data)$gfp <- as.factor(rep(c('p','n'),8));
pData(data)$sample <- as.factor(sort(rep(seq(1,8),2)));

#view the updated phenotype data

Part II - Checking quality, pre-processing and normalisation

#arrayQualityMetrics(data,outdir='./AQM_pre/'); #DO NOT RUN THIS COMMAND

arrayQualityMetrics runs a whole series of tests on the chips, but takes a very long time to compute. I have pre-computed these pre- and post- normalisation and we are going to look through these. In order to do this use your web browser to open the directories AQM_pre and AQM_post from your DICE home (these were two of the files you uncompressed earlier). If you have a problem you can also access them here :- AQM_pre and AQM_post

#now lets normalise the data
norm <- rma(data); # RMA - the Robust Multi-Chip Average - returns expression in log2

#and extract the expression matrix
exprs <- exprs(norm);

Part III - Principal Component Analysis

pca <- prcomp(t(exprs)); # Note, we are doing PCA by chip i.e. trying to see where the variation is between the chips

#look at where the variation is across the Principal Components

#extract the principal components
pcs <- data.frame(pca$x);

#add in the annotation data by merging (just for the first 2 principal components)
pcs <- merge(pcs[,1:2],pData(data),by=0);

#plot the first two principal components

#plot by condition
#plot by gfp status
#plot by sample

Part IV - Differential Expression Calling

#load limma a package for calling differntially expressed genes
#load the Bioconductor annotation file for the Drosophila_2 chips

#convert the expresison matrix from log2 scale to base 10
exprs <- data.frame(2^exprs); # we're just extracting the expression slot from the ExpressionSet object, convert to base10

#add group to the annotation data
d <- data.frame(pData(data));

#rename columns to make easier to identify
colnames(exprs) <- d$group[match(row.names(d),colnames(exprs))];

#these variables can be exported to the environment so that they can be referred to directly

#create an enrichment matrix by dividing GFP+ signal by GFP- signal
enrich <- data.frame(s1=p1/n1,s2=p2/n2,s3=p3/n3,s4=p4/n4,s5=p5/n5,s6=p6/n6,s7=p7/n7,s8=p8/n8,row.names=row.names(exprs));

#set up the 'levels' matrix (this shows the groupings for contrasts)
enrich_design <- table(data.frame(row.names=colnames(enrich),unique(d[,1:2])));
#look at it

#set up the contrast matrix (what we are comparing) - note 'wt-mut'
cont.matrix <- makeContrasts(wt-mut,levels=enrich_design);
#look at it

#fit the linear model
fit <- lmFit(enrich, enrich_design);

#calculate the fit coefficients and error for the comparison(s)
fit2 <-, cont.matrix);

#calculate t-statistics and log-odds of differential expression using emperical Bayes approach
fit2 <- eBayes(fit2);

#have a quick look at the result

#get the differentially expressed genes fdr=1%
diffexp <- topTable(fit2,adjust="fdr",'logFC',n=1000000000,p=0.01);

#reformat to make the results simpler to view
diffexp_top <- data.frame(diffexp[,c(1,2)],fdr=diffexp[,6],fc=diffexp[,2]);
#bring in the gene names
diffexp_final <- merge(diffexp_top,toTable(drosophila2SYMBOL[diffexp_top$ID]),by.x='ID',by.y='probe_id',sort=F,all.x=T);
#look at the top10

Part V - Functional Enrichment Analysis [optional]

#Installing topGO

#Using  package to look for functional enrichment in gene lists

geneList <- fit2$p.value[,1];
names(geneList) <- as.factor(fit2$genes[,1]);

topDiffGenes <- function(x){

#now set up the Drosophila GO

#create the GOData object that contains all of the gene association information for background and your own gene list
sampleGOdata <- new("topGOdata",description = "Simple session", ontology = "BP",allGenes = geneList, geneSel = topDiffGenes,nodeSize = 10,  annot = annFUN.db, affyLib = "drosophila2.db")

#perform a Fisher test to look for statistically significant enrichment of terms
resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher")

#use a Kolmogorov–Smirnov test instead
resultKS <- runTest(sampleGOdata, algorithm = "classic", statistic = "ks")

#apply KS test, with more conservative outcome 'elim'
resultKS.elim <- runTest(sampleGOdata, algorithm = "elim", statistic = "ks")

#display all the tests together
allRes <- GenTable(sampleGOdata, classicFisher = resultFisher, classicKS = resultKS, elimKS = resultKS.elim, orderBy = "elimKS", ranksOf = "classicFisher", topNodes = 30)

Part VI - Clustering [optional]


#check to see that gthe chips segregate by genotype
cl <- kmeans(t(enrich),2);

#cluster the top 100 differentially expressed genes
cl_group <- diffexp_final[1:100,]

#get the ids
ids <- cl_group$ID;

#extract only those expression values
cl_enrich <- enrich[ids,];

#use hierarchical clustering to partition the differentially expressed genes
hc <- hclust(dist(cl_enrich),method='complete');

#show the expression groupings in a heatmap

Final session script

When we've completed the session you can download the complete session R script here

Sources of further information

  1. Microarrays
    • An excellent presentation by the functional genomics group at the European Bioinformatics Institute (EBI) pdf
    • NCBI-GEO
    • An example GEO entry
  2. R
  3. Bioconductor
  4. Statistics
  5. Tutorials
    • Simple explanation of the RMA method from PlexDB
    • Excellent tutorial for PCA from Princeton
    • As it's optional we may not have time to do the functional annotation work, if interested you can find the topGO manual here