Sunday, June 3, 2012

Drawing heatmaps in R with heatmap.2

A heatmap is a scale colour image for representing the observed values of two o more conditions, treatments, populations, etc. The observations can be raw values, norlamized values, fold changes or any others. Let's begin with an example. Say that I'm interesting in the differential expression of the proteome of three cell lines. I have my data in tab delimited text and will upload it into RStudio (a prettier interface for R: http://rstudio.org/).


setwd("C:/Users/Cell_Lines")
data = read.table("Intensities.txt", header=T, stringsAsFactors=F)    # Input the data. 
data = na.omit(data)     # Throw out rows with missing values.
View(data)



row.namesCell_A_rep1Cell_A_rep2Cell_A_rep3Cell_B_rep1Cell_B_rep2Cell_B_rep3
1sp|A6NHR9|SMHD1_HUMAN8.325863e+057.933804e+051.312156e+042.307032e+03235229.3551.351383e+05
2sp|A8CG34|P121C_HUMAN1.282906e+042.592576e+044.963257e+031.613087e+0362421.2571.519155e+04
3sp|O00116|ADAS_HUMAN7.554998e+035.372312e+036.686471e+041.309784e+0525299.4022.170408e+04
4sp|O00148|DX39A_HUMAN9.819398e+051.121632e+064.671197e+055.534510e+05899565.0324.044675e+05
5sp|O00257|CBX4_HUMAN4.139926e+041.995894e+049.100997e+031.623696e+04258025.5181.002840e+05



My data consist of three cell lines (only two are showed) with three replicates each, where each row represents the expression value of a protein (only five are showed). It is classical in R to use the heatmap function which outputs a yellow-orange heatmap with two dendrograms: one on the left side representing the rows (i.e., entities, genes, proteins, etc.), and a second one on the top representing the columns (i.e., conditions, treatments, etc.). Additionally the names of the rows and the columns are written in the right side and bottom, respectively:


heatmap(as.matrix(data))    # The data must be in matrix format and not in frame format!



Some things can be improve in the above heatmap. You can see that the column names didn't fit in the figure, and the rownames are too many that we better avoid them. The heatmap function has many paramters that allow the user to improve the figure. You can see them by typing help(heatmap). Right now, we are interested in two of them the cexCol to adjust the size of the column names and the labRow to choose the name of the rows. Try the following line and you will see how it changes. 

heatmap(as.matrix(data), cexCol=0.7, labRow=NA)



Although heatmap is a good function, a better one exists nowadays and is heatmap.2. Heatmap2 allows further formatting of our heatmap figures. For example, we can change the colours to the common red-green scale, represent the original values or replace them with the row-Z-score, add a colour key and many other options. Let's see! 

install.packages("gplots")
library(gplots)
help(heatmap.2)
heatmap.2(as.matrix(data), col=redgreen(75), scale="row", key=T, keysize=1.5,
          density.info="none", trace="none",cexCol=0.9, labRow=NA)




However this is not the way of dealing with the data. When we work with high throughput data, the first step is to log-transform the intensities and then apply a normalization method. We can do that with the following lines:

data = log2(data)
boxplot(data, cex.axis=0.8, las=2, main="Original distribution of data",
        ylab="Log2(Intensity)")  # Draw a boxplot.


# Normalization
library(preprocessCore)
data2 = normalize.quantiles(as.matrix(data))   # A quantile normalization. 


# Copy the row and column names from data to data2:
rownames(data2) = rownames(data)
colnames(data2) = colnames(data)


boxplot(data2, cex.axis=0.8, las=2, main="Distribution after normalization",
        ylab="Log2(Intensity)") 


# t-test using the limma package:

library(limma)
design = cbind(Cell_A = c(1,1,1,0,0,0,0,0,0), # First 3 columns->Cell_A
               Cell_B = c(0,0,0,1,1,1,0,0,0),
               Cell_C = c(0,0,0,0,0,0,1,1,1)) # Last 3 columns->Cell_C
fit = lmFit(data2, design=design) # Fit the original matrix to the above design.
# We want to compare A vs. B, A vs. C and B vs. C
contrastsMatrix = makeContrasts("Cell_A-Cell_B","Cell_A-Cell_C","Cell_B-Cell_C",
                                levels = design) 
fit2 = contrasts.fit(fit, contrasts = contrastsMatrix) # Making the comparisons.
fit2 = eBayes(fit2) # Moderating the t-tetst by eBayes method.

A heatmap can be seen as an array of figures. The first figure is the real heatmap itself, the second figure is the rows' dendrogram, the third is the columns' dendrogram, and the last figure is the color-key. In that sense, we can control the relative position of each figure using the layout parameter lmat and also introduce blank spaces to tight the figures by introducing 0 (zeros) in the lmat matrix. Additionally, we can customize the width and height of the  array by tuning the parameters lhei and lwid. Two more things! First, when working with microarray data, the common colours are red and green, but the current fashion in proteomics is to use red and blue. Secondly, lets suppose that the first 250 proteins belong to a same family, the second 250 to another and so on; and now we wanna know if they actually group together in the dendrogram. We assing a colour to each group, and then match each protein in the heatmap with its corresponding colour, using the parameter RowSideColors. When introducing a row side colour bar, this element becomes now the first element of the figure-matrix, so the real heatmap becomes the second element, the rows' dendrogram the third, the columns' dendrogram the fourth and the color-key the fith. You can see all this changes in the heatmap below.


data3 = as.matrix(fit)
Label = c(rep("purple",250),rep("orange",250),rep("darkgreen",250),
          rep("brown",323))
heatmap.2(data3, col=redblue(256), dendrogram="both",
          scale="row", key=T, keysize=0.5, density.info="none",
          trace="none",cexCol=1.2, labRow=NA, RowSideColors=Label,
          lmat=rbind(c(5,0,4,0),c(3,1,2,0)), lhei=c(2.0,5.0),
          lwid=c(1.5,0.2,2.5,2.5))



Finally, if we had a table with the fold change values per each comparison, and a second table containing the p-values of those fold changes, we can plot both in a single heatmap where the colours represent the fold change values and the numbers over each colour square its respective p-value. In my heatmap below the numbers over the squares represent the p-values in a discrete scale, where -1 means non-significant, 0 means significant and 1 highly significant. To write the p.values over the heatmap, we use the parameter cellnote, and if we want them in black, we need to set  notecol = "black". Additionally, since we want to plot the real fold change values and not their Z-score, we set scale = "none". Look at the following lines.

FoldCh = read.table("Fold_changes.txt",header=T,stringsAsFactors=F)
Pval = read.table("Pvalues.txt",header=T,stringsAsFactors=F)
heatmap.2(as.matrix(FoldCh),dendrogram="col", cellnote=as.matrix(PVal),
          notecol="black",col=redblue(256),scale="none",key=TRUE, keysize=1.5,
          density.info="none", trace="none", cexRow=0.7,cexCol=1.2)





And that's the end of this post. I hope you manage to do something similar with your own data. Till next time!



4 comments:

  1. chevere tío, soy pável, está bueno el blog... y cómo es que el "heatmap function" construye los dendrogramas? basados en similaridad de expresión de cada línea celular por ejemplo? o simplemente agrupando cada tipo celular en clusters
    un abrazo, éxitos man!

    ReplyDelete
  2. Hello. I would to know how did you convert the p.values into a discrete scale of -1,0,1?

    Thank you.

    Telma Fernandes

    ReplyDelete
  3. Thanks for explaining so nicely. Is there any way to color dendrogram edges in heatmap.2?

    Thanks

    Tauqeer

    ReplyDelete