Bioinformatics 1 - Microarrays

In this session I am going to introduce gene expression microarrays, which are used to characterise the RNA composition of a sample across thousands of genes simultaneously. This allows biologists to use discovery or data-driven approaches to their work. Typical examples of microarray use include comparing gene expression between wild-type and mutant samples, temporal profiling, assaying drug effects and identifying bio-markers for disease.

Microarray analysis falls into a category of high-throughput technologies that generate large multi-variate data sets for which principled methods need to be employed to accurately quantify changes. This begins with quality control and normalisation processes to allow meaningful cross-comparison, properly modelled statistical testing to identify changes and ends with an assimilation of the primary results into further downstream analyses to interpret the changes in a biologically meaningful way.

Overview

  1. Lecture slides can be found here
  2. R/Bioconductor - statistical computing for the biological sciences
  3. Importing microarray data
  4. Quality control, pre-processing and normalisation
  5. Principal component analysis (PCA)
  6. Differential expression calling
  7. Functional enrichment analysis
  8. Clustering

R/Bioconductor

Log in to your DICE account

Please make a directory called bio1_microarray in your home directory and then change into it.

mkdir bio1_microarray
cd bio1_microarray

now download the following files into the directory using your web browser

next uncompress the archived data using the following command

tar -zxf *

next invoke R at the command line

R

Then execute the following lines to install the necessary libraries for the session :-

install.packages('lattice');
source("http://bioconductor.org/biocLite.R");
biocLite("affy");
biocLite("limma");
biocLite("drosophila2.db");

This will take a while as the components are compiled into the R package library.

For future reference a guide to installing R/Bioconductor can be found here http://www.bioconductor.org/install/

Part I - loading the microarray data

#load libraries
library(affy);

#set working directory
setwd('./');

#load Affymetrix CEL files
data <- ReadAffy(); #loading CEL files for wild-type and mutant samples GFP +ve and -ve (16 chips)

#find out a bit about the object
class(data);

#what's an AffyBatch
?AffyBatch

#some access methods
probes(data,which='pm'); #probe perfect match expression
boxplot(data); #box-whisker plots each chip
image(data[,1]); #see a chip image plot

#examine the slots
data@cdfName; #get the chip platform name
data@phenoData; #get the experimental data

#what's an AnnotatedDataFrame ?
?AnnotatedDataFrame;

#lets annotate on the experimental information
pData(data)$condition <- as.factor(c(rep('wt',8),rep('mut',8)));
pData(data)$gfp <- as.factor(rep(c('p','n'),8));
pData(data)$sample <- as.factor(sort(rep(seq(1,8),2)));

#view the updated phenotype data
pData(data);

Part II - Checking quality, pre-processing and normalisation

#library(arrayQualityMetrics);
#arrayQualityMetrics(data,outdir='./AQM_pre/'); #DO NOT RUN THIS COMMAND

arrayQualityMetrics runs a whole series of tests on the chips, but takes a very long time to compute. I have pre-computed these pre- and post- normalisation and we are going to look through these. In order to do this use your web browser to open the directories AQM_pre and AQM_post from your DICE home (these were two of the files you uncompressed earlier). If you have a problem you can also access them here :- AQM_pre and AQM_post

#now lets normalise the data
norm <- rma(data); # RMA - the Robust Multi-Chip Average - returns expression in log2

#and extract the expression matrix
exprs <- exprs(norm);

Part III - Principal Component Analysis

pca <- prcomp(t(exprs)); # Note, we are doing PCA by chip i.e. trying to see where the variation is between the chips

#look at where the variation is across the Principal Components
summary(pca);

#extract the principal components
pcs <- data.frame(pca$x);

#add in the annotation data by merging (just for the first 2 principal components)
pcs <- merge(pcs[,1:2],pData(data),by=0);

#plot the first two principal components
library(lattice);

#plot by condition
xyplot(PC2~PC1,pcs,group=condition,auto.key=T);
#plot by gfp status
xyplot(PC2~PC1,pcs,group=gfp,auto.key=T);
#plot by sample
xyplot(PC2~PC1,pcs,group=sample,auto.key=T);

Part IV - Differential Expression Calling

#load limma a package for calling differntially expressed genes
library(limma);
#load the Bioconductor annotation file for the Drosophila_2 chips
library(drosophila2.db);

#convert the expresison matrix from log2 scale to base 10
exprs <- data.frame(2^exprs); # we're just extracting the expression slot from the ExpressionSet object, convert to base10

#add group to the annotation data
d <- data.frame(pData(data));
d$group<-interaction(d$gfp,d$sample,sep='');

#rename columns to make easier to identify
colnames(exprs) <- d$group[match(row.names(d),colnames(exprs))];

#these variables can be exported to the environment so that they can be referred to directly
attach(exprs);

#create an enrichment matrix by dividing GFP+ signal by GFP- signal
enrich <- data.frame(s1=p1/n1,s2=p2/n2,s3=p3/n3,s4=p4/n4,s5=p5/n5,s6=p6/n6,s7=p7/n7,s8=p8/n8,row.names=row.names(exprs));

#set up the 'levels' matrix (this shows the groupings for contrasts)
enrich_design <- table(data.frame(row.names=colnames(enrich),unique(d[,1:2])));
#look at it
enrich_design;

#set up the contrast matrix (what we are comparing) - note 'wt-mut'
cont.matrix <- makeContrasts(wt-mut,levels=enrich_design);
#look at it
cont.matrix;

#fit the linear model
fit <- lmFit(enrich, enrich_design);

#calculate the fit coefficients and error for the comparison(s)
fit2 <- contrasts.fit(fit, cont.matrix);

#calculate t-statistics and log-odds of differential expression using emperical Bayes approach
fit2 <- eBayes(fit2);

#have a quick look at the result
topTable(fit2,n=10,sort.by='logFC',p=0.01,adjust='fdr');

#get the differentially expressed genes fdr=1%
diffexp <- topTable(fit2,adjust="fdr",sort.by='logFC',n=1000000000,p=0.01);

#reformat to make the results simpler to view
diffexp_top <- data.frame(diffexp[,c(1,2)],fdr=diffexp[,6],fc=diffexp[,2]);
#bring in the gene names
diffexp_final <- merge(diffexp_top,toTable(drosophila2SYMBOL[diffexp_top$ID]),by.x='ID',by.y='probe_id',sort=F,all.x=T);
#look at the top10
diffexp_final[1:10,];

Part V - Functional Enrichment Analysis [optional]

#Installing topGO
source("http://bioconductor.org/biocLite.R");
biocLite("topGO");

#Using  package to look for functional enrichment in gene lists
library(topGO);

geneList <- fit2$p.value[,1];
names(geneList) <- as.factor(fit2$genes[,1]);

topDiffGenes <- function(x){
  return(x<=0.01)
}

#now set up the Drosophila GO
library(drosophila2.db);

#create the GOData object that contains all of the gene association information for background and your own gene list
sampleGOdata <- new("topGOdata",description = "Simple session", ontology = "BP",allGenes = geneList, geneSel = topDiffGenes,nodeSize = 10,  annot = annFUN.db, affyLib = "drosophila2.db")

#perform a Fisher test to look for statistically significant enrichment of terms
resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher")

#use a Kolmogorov–Smirnov test instead
resultKS <- runTest(sampleGOdata, algorithm = "classic", statistic = "ks")

#apply KS test, with more conservative outcome 'elim'
resultKS.elim <- runTest(sampleGOdata, algorithm = "elim", statistic = "ks")

#display all the tests together
allRes <- GenTable(sampleGOdata, classicFisher = resultFisher, classicKS = resultKS, elimKS = resultKS.elim, orderBy = "elimKS", ranksOf = "classicFisher", topNodes = 30)

Part VI - Clustering [optional]

library(cluster)

#check to see that gthe chips segregate by genotype
cl <- kmeans(t(enrich),2);

#cluster the top 100 differentially expressed genes
cl_group <- diffexp_final[1:100,]

#get the ids
ids <- cl_group$ID;

#extract only those expression values
cl_enrich <- enrich[ids,];

#use hierarchical clustering to partition the differentially expressed genes
hc <- hclust(dist(cl_enrich),method='complete');

#show the expression groupings in a heatmap
heatmap(as.matrix(cl_enrich),Rowv=hc$order);

Final session script

When we've completed the session you can download the complete session R script here

Sources of further information

  1. Microarrays
    • An excellent presentation by the functional genomics group at the European Bioinformatics Institute (EBI) pdf
    • NCBI-GEO
    • An example GEO entry
  2. R
  3. Bioconductor
  4. Statistics
  5. Tutorials
    • Simple explanation of the RMA method from PlexDB
    • Excellent tutorial for PCA from Princeton
    • As it's optional we may not have time to do the functional annotation work, if interested you can find the topGO manual here