# Bioinformatics 1 - Microarrays

In this session I am going to introduce gene expression microarrays, which are used to characterise the RNA composition of a sample across thousands of genes simultaneously. This allows biologists to use discovery or data-driven approaches to their work. Typical examples of microarray use include comparing gene expression between wild-type and mutant samples, temporal profiling, assaying drug effects and identifying bio-markers for disease.

Microarray analysis falls into a category of high-throughput technologies that generate large multi-variate data sets for which principled methods need to be employed to accurately quantify changes. This begins with quality control and normalisation processes to allow meaningful cross-comparison, properly modelled statistical testing to identify changes and ends with an assimilation of the primary results into further downstream analyses to interpret the changes in a biologically meaningful way.

**Overview**

- Lecture slides can be found here
- R/Bioconductor - statistical computing for the biological sciences
- Importing microarray data
- Quality control, pre-processing and normalisation
- Principal component analysis (PCA)
- Differential expression calling
- Functional enrichment analysis
- Clustering

**R/Bioconductor**

Log in to your DICE account

Please make a directory called bio1_microarray in your home directory and then change into it.

mkdir bio1_microarray cd bio1_microarray

now download the following files into the directory using your web browser

- bio1_microarray_chip_data.tar.gz (53M)
- bio1_AQM_pre.tar.gz (3.8M)
- bio1_AQM_post.tar.gz (0.5M)

next uncompress the archived data using the following command

tar -zxf *

next invoke R at the command line

R

Then execute the following lines to install the necessary libraries for the session :-

install.packages('lattice'); source("http://bioconductor.org/biocLite.R"); biocLite("affy"); biocLite("limma"); biocLite("drosophila2.db");

This will take a while as the components are compiled into the R package library.

For future reference a guide to installing R/Bioconductor can be found here http://www.bioconductor.org/install/

**Part I - loading the microarray data**

#load libraries library(affy); #set working directory setwd('./'); #load Affymetrix CEL files data <- ReadAffy(); #loading CEL files for wild-type and mutant samples GFP +ve and -ve (16 chips) #find out a bit about the object class(data); #what's an AffyBatch ?AffyBatch #some access methods probes(data,which='pm'); #probe perfect match expression boxplot(data); #box-whisker plots each chip image(data[,1]); #see a chip image plot #examine the slots data@cdfName; #get the chip platform name data@phenoData; #get the experimental data #what's an AnnotatedDataFrame ? ?AnnotatedDataFrame; #lets annotate on the experimental information pData(data)$condition <- as.factor(c(rep('wt',8),rep('mut',8))); pData(data)$gfp <- as.factor(rep(c('p','n'),8)); pData(data)$sample <- as.factor(sort(rep(seq(1,8),2))); #view the updated phenotype data pData(data);

**Part II - Checking quality, pre-processing and normalisation**

#library(arrayQualityMetrics); #arrayQualityMetrics(data,outdir='./AQM_pre/'); #DO NOT RUN THIS COMMAND

arrayQualityMetrics runs a whole series of tests on the chips, but takes a very long time to compute. I have pre-computed these pre- and post- normalisation and we are going to look through these. In order to do this use your web browser to open the directories AQM_pre and AQM_post from your DICE home (these were two of the files you uncompressed earlier). If you have a problem you can also access them here :- AQM_pre and AQM_post

#now lets normalise the data norm <- rma(data); # RMA - the Robust Multi-Chip Average - returns expression in log2 #and extract the expression matrix exprs <- exprs(norm);

**Part III - Principal Component Analysis**

pca <- prcomp(t(exprs)); # Note, we are doing PCA by chip i.e. trying to see where the variation is between the chips #look at where the variation is across the Principal Components summary(pca); #extract the principal components pcs <- data.frame(pca$x); #add in the annotation data by merging (just for the first 2 principal components) pcs <- merge(pcs[,1:2],pData(data),by=0); #plot the first two principal components library(lattice); #plot by condition xyplot(PC2~PC1,pcs,group=condition,auto.key=T); #plot by gfp status xyplot(PC2~PC1,pcs,group=gfp,auto.key=T); #plot by sample xyplot(PC2~PC1,pcs,group=sample,auto.key=T);

**Part IV - Differential Expression Calling**

#load limma a package for calling differntially expressed genes library(limma); #load the Bioconductor annotation file for the Drosophila_2 chips library(drosophila2.db); #convert the expresison matrix from log2 scale to base 10 exprs <- data.frame(2^exprs); # we're just extracting the expression slot from the ExpressionSet object, convert to base10 #add group to the annotation data d <- data.frame(pData(data)); d$group<-interaction(d$gfp,d$sample,sep=''); #rename columns to make easier to identify colnames(exprs) <- d$group[match(row.names(d),colnames(exprs))]; #these variables can be exported to the environment so that they can be referred to directly attach(exprs); #create an enrichment matrix by dividing GFP+ signal by GFP- signal enrich <- data.frame(s1=p1/n1,s2=p2/n2,s3=p3/n3,s4=p4/n4,s5=p5/n5,s6=p6/n6,s7=p7/n7,s8=p8/n8,row.names=row.names(exprs)); #set up the 'levels' matrix (this shows the groupings for contrasts) enrich_design <- table(data.frame(row.names=colnames(enrich),unique(d[,1:2]))); #look at it enrich_design; #set up the contrast matrix (what we are comparing) - note 'wt-mut' cont.matrix <- makeContrasts(wt-mut,levels=enrich_design); #look at it cont.matrix; #fit the linear model fit <- lmFit(enrich, enrich_design); #calculate the fit coefficients and error for the comparison(s) fit2 <- contrasts.fit(fit, cont.matrix); #calculate t-statistics and log-odds of differential expression using emperical Bayes approach fit2 <- eBayes(fit2); #have a quick look at the result topTable(fit2,n=10,sort.by='logFC',p=0.01,adjust='fdr'); #get the differentially expressed genes fdr=1% diffexp <- topTable(fit2,adjust="fdr",sort.by='logFC',n=1000000000,p=0.01); #reformat to make the results simpler to view diffexp_top <- data.frame(diffexp[,c(1,2)],fdr=diffexp[,6],fc=diffexp[,2]); #bring in the gene names diffexp_final <- merge(diffexp_top,toTable(drosophila2SYMBOL[diffexp_top$ID]),by.x='ID',by.y='probe_id',sort=F,all.x=T); #look at the top10 diffexp_final[1:10,];

**Part V - Functional Enrichment Analysis** [optional]

#Installing topGO source("http://bioconductor.org/biocLite.R"); biocLite("topGO"); #Using package to look for functional enrichment in gene lists library(topGO); geneList <- fit2$p.value[,1]; names(geneList) <- as.factor(fit2$genes[,1]); topDiffGenes <- function(x){ return(x<=0.01) } #now set up the Drosophila GO library(drosophila2.db); #create the GOData object that contains all of the gene association information for background and your own gene list sampleGOdata <- new("topGOdata",description = "Simple session", ontology = "BP",allGenes = geneList, geneSel = topDiffGenes,nodeSize = 10, annot = annFUN.db, affyLib = "drosophila2.db") #perform a Fisher test to look for statistically significant enrichment of terms resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher") #use a Kolmogorovâ€“Smirnov test instead resultKS <- runTest(sampleGOdata, algorithm = "classic", statistic = "ks") #apply KS test, with more conservative outcome 'elim' resultKS.elim <- runTest(sampleGOdata, algorithm = "elim", statistic = "ks") #display all the tests together allRes <- GenTable(sampleGOdata, classicFisher = resultFisher, classicKS = resultKS, elimKS = resultKS.elim, orderBy = "elimKS", ranksOf = "classicFisher", topNodes = 30)

**Part VI - Clustering** [optional]

library(cluster) #check to see that gthe chips segregate by genotype cl <- kmeans(t(enrich),2); #cluster the top 100 differentially expressed genes cl_group <- diffexp_final[1:100,] #get the ids ids <- cl_group$ID; #extract only those expression values cl_enrich <- enrich[ids,]; #use hierarchical clustering to partition the differentially expressed genes hc <- hclust(dist(cl_enrich),method='complete'); #show the expression groupings in a heatmap heatmap(as.matrix(cl_enrich),Rowv=hc$order);

**Final session script**

When we've completed the session you can download the complete session R script here

**Sources of further information**

- Microarrays
- R
- http://www.r-project.org/ R home
- http://www.stats.bris.ac.uk/R/ CRAN
- http://www.stats.bris.ac.uk/R/doc/manuals/R-lang.html an introduction to R
- http://www.statmethods.net/ very nice site with lots of examples
- http://gallery.r-enthusiasts.com/ R graph examples
- http://www.his.sunderland.ac.uk/~cs0her/Statistics/UsingLatticeGraphicsInR.htm using the lattice package
- Bioconductor
- http://www.bioconductor.org/ Bioconductor home
- http://www.bioconductor.org/help/workflows/oligo-arrays/ guide to oligo-array analysis
- Statistics
- Description of the Fisher exact test

- Tutorials