tag:blogger.com,1999:blog-25480123059276586492024-02-19T08:27:10.785+02:00Mannheimia goes programmingMannheimiahttp://www.blogger.com/profile/16733590560025537189noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-2548012305927658649.post-54410815388106277832012-06-03T22:16:00.001+03:002012-07-11T14:24:18.811+03:00Drawing heatmaps in R with heatmap.2<div style="text-align: justify;">
<span style="font-size: large;">A heatmap is a scale colour image for representing the observed values of two o more conditions, treatments, populations, etc. The observations can be raw values, norlamized values, fold changes or any others. Let's begin with an example. Say that I'm interesting in the differential expression of the proteome of three cell lines. I have my data in tab delimited text and will upload it into RStudio (a prettier interface for R: <a href="http://rstudio.org/">http://rstudio.org/</a>).</span></div>
<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">setwd("C:/Users/Cell_Lines")</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">data = read.table("Intensities.txt", header=T, stringsAsFactors=F) # Input the data. </span><br />
<span style="font-family: 'Courier New', Courier, monospace;">data = na.omit(data) # Throw out rows with missing values.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">View(data)</span><br />
<br />
<br />
<span class="Apple-style-span" style="border-collapse: separate; border-spacing: 0px; font-family: Arial;"></span><br />
<table style="border-bottom-style: none; border-collapse: collapse; border-color: initial; border-left-style: none; border-right-style: none; border-top-style: none; border-width: initial; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><thead>
<tr><td id="origin" style="background-color: #f0f0f0; border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: initial; border-left-style: none; border-left-width: initial; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; font-weight: bold; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; text-align: right; white-space: pre;"></td><th style="background-color: #f0f0f0; border: 1px solid rgb(221, 221, 221); color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; padding: 3px 12px 3px 6px; text-align: left; white-space: pre;">row.names</th><th style="background-color: #f0f0f0; border: 1px solid rgb(221, 221, 221); color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; padding: 3px 12px 3px 6px; text-align: left; white-space: pre;">Cell_A_rep1</th><th style="background-color: #f0f0f0; border: 1px solid rgb(221, 221, 221); color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; padding: 3px 12px 3px 6px; text-align: left; white-space: pre;">Cell_A_rep2</th><th style="background-color: #f0f0f0; border: 1px solid rgb(221, 221, 221); color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; padding: 3px 12px 3px 6px; text-align: left; white-space: pre;">Cell_A_rep3</th><th style="background-color: #f0f0f0; border: 1px solid rgb(221, 221, 221); color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; padding: 3px 12px 3px 6px; text-align: left; white-space: pre;">Cell_B_rep1</th><th style="background-color: #f0f0f0; border: 1px solid rgb(221, 221, 221); color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; padding: 3px 12px 3px 6px; text-align: left; white-space: pre;">Cell_B_rep2</th><th style="background-color: #f0f0f0; border: 1px solid rgb(221, 221, 221); color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; padding: 3px 12px 3px 6px; text-align: left; white-space: pre;">Cell_B_rep3</th></tr>
</thead><tbody>
<tr><td class="rn" style="background-color: #f0f0f0; border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: initial; border-left-style: none; border-left-width: initial; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; font-weight: bold; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; text-align: right; white-space: pre;">1</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">sp|A6NHR9|SMHD1_HUMAN</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">8.325863e+05</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">7.933804e+05</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.312156e+04</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">2.307032e+03</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">235229.355</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.351383e+05</td></tr>
<tr><td class="rn" style="background-color: #f0f0f0; border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: initial; border-left-style: none; border-left-width: initial; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; font-weight: bold; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; text-align: right; white-space: pre;">2</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">sp|A8CG34|P121C_HUMAN</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.282906e+04</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">2.592576e+04</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">4.963257e+03</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.613087e+03</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">62421.257</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.519155e+04</td></tr>
<tr><td class="rn" style="background-color: #f0f0f0; border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: initial; border-left-style: none; border-left-width: initial; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; font-weight: bold; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; text-align: right; white-space: pre;">3</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">sp|O00116|ADAS_HUMAN</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">7.554998e+03</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">5.372312e+03</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">6.686471e+04</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.309784e+05</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">25299.402</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">2.170408e+04</td></tr>
<tr><td class="rn" style="background-color: #f0f0f0; border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: initial; border-left-style: none; border-left-width: initial; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; font-weight: bold; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; text-align: right; white-space: pre;">4</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">sp|O00148|DX39A_HUMAN</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">9.819398e+05</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.121632e+06</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">4.671197e+05</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">5.534510e+05</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">899565.032</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">4.044675e+05</td></tr>
<tr><td class="rn" style="background-color: #f0f0f0; border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: initial; border-left-style: none; border-left-width: initial; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; color: #555555; font-family: 'Segoe UI', 'Lucida Grande', Verdana, Helvetica; font-size: 11px; font-weight: bold; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; text-align: right; white-space: pre;">5</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">sp|O00257|CBX4_HUMAN</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">4.139926e+04</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.995894e+04</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">9.100997e+03</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.623696e+04</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">258025.518</td><td style="border-bottom-color: rgb(221, 221, 221); border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(221, 221, 221); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(221, 221, 221); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-top-width: 1px; font-family: Consolas, 'Lucida Console', Monaco, monospace; font-size: 11px; padding-bottom: 3px; padding-left: 6px; padding-right: 12px; padding-top: 3px; white-space: pre;">1.002840e+05</td></tr>
</tbody></table>
<br />
<br />
<br />
<div style="text-align: justify;">
<span style="font-size: large;">My data consist of three cell lines (only two are showed) with three replicates each, where each row represents the expression value of a protein (only five are showed). It is classical in R to use the </span><span style="font-family: 'Courier New', Courier, monospace;">heatmap</span><span style="font-size: large;"> function which outputs a yellow-orange heatmap with two dendrograms: one on the left side representing the rows (i.e., entities, genes, proteins, etc.), and a second one on the top representing the columns (i.e., conditions, treatments, etc.). Additionally the names of the rows and the columns are written in the right side and bottom, respectively:</span></div>
<span style="font-size: large;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;">heatmap(as.matrix(data)) # The data must be in matrix format and not in frame format!</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifNnUIs29JTMDOnunBgVYosYJVOQidGVVv-6ee4lxrislFr_VBjmxZ0aTDpbcZ58d9I-2-Ud_DvbO9_fATVokb91W-hpYX2CE4JlOPmejv5zHHT3STvnK188ZaN9ql8lTrsBWS17rxztpl/s1600/Heatmap1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="155" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifNnUIs29JTMDOnunBgVYosYJVOQidGVVv-6ee4lxrislFr_VBjmxZ0aTDpbcZ58d9I-2-Ud_DvbO9_fATVokb91W-hpYX2CE4JlOPmejv5zHHT3STvnK188ZaN9ql8lTrsBWS17rxztpl/s200/Heatmap1.png" width="200" /></a></div>
<br />
<br />
<div style="text-align: justify;">
<span style="font-size: large;">Some things can be improve in the above heatmap. You can see that the column names didn't fit in the figure, and the rownames are too many that we better avoid them. The heatmap function has many paramters that allow the user to improve the figure. You can see them by typing </span><span style="font-family: 'Courier New', Courier, monospace;">help(heatmap)</span><span style="font-size: large;">. Right now, we are interested in two of them the </span><span style="font-family: 'Courier New', Courier, monospace;">cexCol</span><span style="font-size: large;"> to adjust the size of the column names and the </span><span style="font-family: 'Courier New', Courier, monospace;">labRow</span><span style="font-size: large;"> to choose the name of the rows. Try the following line and you will see how it changes. </span></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<span style="font-family: 'Courier New', Courier, monospace;">heatmap(as.matrix(data), cexCol=0.7, labRow=NA)</span></div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwd9ylAh8rnpkkpX4HmvlkFF99-jMWMrmNagiy3hoewlUOoo65UaYV0COac7KTOzSyAie3gSr0CcWhzJeVa13nvXqMMWf0n9ZLg-1MADBYD64LmMOPQZcBaLFzWjqBEfF5CTu8h_lpMIPk/s1600/heatmap2.png" imageanchor="1"><img border="0" height="248" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwd9ylAh8rnpkkpX4HmvlkFF99-jMWMrmNagiy3hoewlUOoo65UaYV0COac7KTOzSyAie3gSr0CcWhzJeVa13nvXqMMWf0n9ZLg-1MADBYD64LmMOPQZcBaLFzWjqBEfF5CTu8h_lpMIPk/s320/heatmap2.png" width="320" /></a></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: justify;">
<span style="font-size: large;">Although heatmap is a good function, a better one exists nowadays and is </span><span style="font-family: 'Courier New', Courier, monospace;">heatmap.2</span><span style="font-size: large;">. Heatmap2 allows further formatting of our heatmap figures. For example, we can change the colours to the common red-green scale, represent the original values or replace them with the row-Z-score, add a colour key and many other options. Let's see! </span></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<span style="font-family: 'Courier New', Courier, monospace;">install.packages("gplots")</span></div>
<div style="text-align: left;">
<span style="font-family: 'Courier New', Courier, monospace;">library(gplots)</span></div>
<div style="text-align: left;">
<span style="font-family: 'Courier New', Courier, monospace;">help(heatmap.2)</span></div>
<div style="text-align: left;">
</div>
<span style="font-family: 'Courier New', Courier, monospace;">heatmap.2(as.matrix(data), col=redgreen(75), scale="row", key=T, keysize=1.5,</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> density.info="none", trace="none",cexCol=0.9, labRow=NA)</span><br />
<br />
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyJACae-uteyp6EIr4RsymE-LT8MYRJzmDRBhXu3h97xhKZwoIjoqyMuCJOce96WkBmK-I1GXDUM96-TzFs-Qso7aBTtG7tcwI7Y9isvEsiFdN5JsxiuoHimJfiwCGVSyL3dH8jihyphenhyphen9FvN/s1600/Heatmap3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="249" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyJACae-uteyp6EIr4RsymE-LT8MYRJzmDRBhXu3h97xhKZwoIjoqyMuCJOce96WkBmK-I1GXDUM96-TzFs-Qso7aBTtG7tcwI7Y9isvEsiFdN5JsxiuoHimJfiwCGVSyL3dH8jihyphenhyphen9FvN/s320/Heatmap3.png" width="320" /></a></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: inherit; font-size: large;">However this is not the way of dealing with the data. When we work with high throughput data, the first step is to log-transform the intensities and then apply a normalization method. We can do that with the following lines:</span></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<span style="font-family: 'Courier New', Courier, monospace;">data = log2(data)</span></div>
<div style="text-align: left;">
</div>
<span style="font-family: 'Courier New', Courier, monospace;">boxplot(data, cex.axis=0.8, las=2, main="Original distribution of data",</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> ylab="Log2(Intensity)") # Draw a boxplot.</span><br />
<br />
<div style="text-align: left;">
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span></div>
<div style="text-align: left;">
</div>
<span style="font-family: 'Courier New', Courier, monospace;"># Normalization</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">library(preprocessCore)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">data2 = normalize.quantiles(as.matrix(data)) # A quantile normalization. </span><br />
<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"># Copy the row and column names from data to data2:</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">rownames(data2) = rownames(data)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">colnames(data2) = colnames(data)</span><br />
<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">boxplot(data2, cex.axis=0.8, las=2, main="Distribution after normalization",</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> ylab="Log2(Intensity)") </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"># t-test using the limma package:</span><br />
<span class="Apple-style-span" style="border-collapse: separate; border-spacing: 0px;"></span><br />
<pre><span style="font-family: 'Courier New', Courier, monospace;">library(limma)
design = cbind(Cell_A = c(1,1,1,0,0,0,0,0,0), # First 3 columns->Cell_A
Cell_B = c(0,0,0,1,1,1,0,0,0),
Cell_C = c(0,0,0,0,0,0,1,1,1)) # Last 3 columns->Cell_C
fit = lmFit(data2, design=design) # Fit the original matrix to the above design.
# We want to compare A vs. B, A vs. C and B vs. C
contrastsMatrix = makeContrasts("Cell_A-Cell_B","Cell_A-Cell_C","Cell_B-Cell_C",
levels = design)
fit2 = contrasts.fit(fit, contrasts = contrastsMatrix) # Making the comparisons.
fit2 = eBayes(fit2) # Moderating the t-tetst by eBayes method.</span></pre>
<pre><span style="font-family: inherit; font-size: large; text-align: justify;">
</span></pre>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">A heatmap can be seen as an array of figures. The first figure is the real heatmap itself, the second figure is the rows' dendrogram, the third is the columns' dendrogram, and the last figure is the color-key. In that sense, we can control the relative position of each figure using the layout parameter </span><span style="font-family: 'Courier New', Courier, monospace;">lmat</span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"> and also introduce blank spaces to tight the figures by introducing </span><span style="font-family: 'Courier New', Courier, monospace; font-size: large;">0 </span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">(zeros)</span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"> in the lmat matrix</span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">. Additionally, we can customize the width and height of the array by tuning the parameters </span><span style="font-family: 'Courier New', Courier, monospace;">lhei</span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"> and </span><span style="font-family: 'Courier New', Courier, monospace;">lwid</span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">. Two more things! First, when working with microarray data, the common colours are red and green, but the current fashion </span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">in proteomics </span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">is to use red and blue. Secondly, lets suppose that the first 250 proteins belong to a same family, the second 250 to another and so on; and now we wanna know if they actually group together in the dendrogram. We assing a colour to each group, and then match each protein in the heatmap with its corresponding colour, using the parameter </span><span style="font-family: 'Courier New', Courier, monospace;">RowSideColors</span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">. When introducing a row side colour bar, this element becomes now the first element of the figure-matrix, so the real heatmap becomes the second element, the rows' dendrogram the third, the columns' dendrogram the fourth and the color-key the fith. You can see all this changes in the heatmap below.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;">data3 = as.matrix(fit)</span></div>
<pre><span style="font-family: 'Courier New', Courier, monospace;">Label = c(rep("purple",250),rep("orange",250),rep("darkgreen",250),
rep("brown",323))</span></pre>
<pre><span style="font-family: 'Courier New', Courier, monospace;">heatmap.2(data3, col=redblue(256), dendrogram="both",
scale="row", key=T, keysize=0.5, density.info="none",
trace="none",cexCol=1.2, labRow=NA, RowSideColors=Label,
lmat=rbind(c(5,0,4,0),c(3,1,2,0)), lhei=c(2.0,5.0),
lwid=c(1.5,0.2,2.5,2.5))</span></pre>
<pre><span style="font-family: 'Courier New', Courier, monospace;">
</span></pre>
<pre><span style="font-family: 'Courier New', Courier, monospace;">
</span></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRGwN-2uqo85qGtxSv07pZt85Uy4qLnnqn4TlIDnbLTAx_Q1nN00h0wQBZeBYeH_k0Sj-tvsY8uK9aRHS_YL1CQr4nrF7PXj_2k9HFumFjqzKYNNuWgLqusb5gjPKASf8IvpWU7jqkyIP0/s1600/Heatmap5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="249" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRGwN-2uqo85qGtxSv07pZt85Uy4qLnnqn4TlIDnbLTAx_Q1nN00h0wQBZeBYeH_k0Sj-tvsY8uK9aRHS_YL1CQr4nrF7PXj_2k9HFumFjqzKYNNuWgLqusb5gjPKASf8IvpWU7jqkyIP0/s320/Heatmap5.png" width="320" /></a></div>
<pre><span style="font-family: 'Courier New', Courier, monospace;">
</span></pre>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Finally, if we had a table with the fold change values per each comparison, and a second table containing the p-values of those fold changes, we can plot both in a single heatmap where the colours represent the fold change values and the numbers over each colour square its respective p-value. In my heatmap below the numbers over the squares represent the p-values in a discrete scale, where -1 means non-significant, 0 means significant and 1 highly significant. To write the p.values over the heatmap, we use the parameter </span><span style="font-family: 'Courier New', Courier, monospace;">cellnote</span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">, and if we want them in black, we need to set </span><span style="font-family: 'Courier New', Courier, monospace;">notecol = "black"</span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">. Additionally, since we want to plot the real fold change values and not their Z-score, we set </span><span style="font-family: 'Courier New', Courier, monospace;">scale = "none"</span><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">. Look at the following lines.</span></div>
<pre><span style="font-family: Times, 'Times New Roman', serif;">
</span></pre>
<pre><span style="font-family: 'Courier New', Courier, monospace;">FoldCh = read.table("Fold_changes.txt",header=T,stringsAsFactors=F)
Pval = read.table("Pvalues.txt",header=T,stringsAsFactors=F)</span></pre>
<pre><span style="font-family: 'Courier New', Courier, monospace;">heatmap.2(as.matrix(FoldCh),dendrogram="col", cellnote=as.matrix(PVal),
notecol="black",col=redblue(256),scale="none",key=TRUE, keysize=1.5,
density.info="none", trace="none", cexRow=0.7,cexCol=1.2)</span></pre>
<pre><span style="font-family: 'Courier New', Courier, monospace;">
</span></pre>
<pre><span style="font-family: Times, 'Times New Roman', serif;">
</span></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKM4LHwzRmAQQjqr3fxT3nv9bgjJqYnhJOCCpfxj5cu_huzvm_T5R02rta-iJq8yZmj6yU2JUmVTTIyRcPXmnJjNeUSnnAz8LGc03_A_h4Wnyx-Nt578PAb25ejR3ihiLWFaMPGIUWBDOv/s1600/Heatmap6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="311" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKM4LHwzRmAQQjqr3fxT3nv9bgjJqYnhJOCCpfxj5cu_huzvm_T5R02rta-iJq8yZmj6yU2JUmVTTIyRcPXmnJjNeUSnnAz8LGc03_A_h4Wnyx-Nt578PAb25ejR3ihiLWFaMPGIUWBDOv/s400/Heatmap6.png" width="400" /></a></div>
<pre><span style="font-family: Times, 'Times New Roman', serif;">
</span></pre>
<pre><span style="font-family: Times, 'Times New Roman', serif;">
</span></pre>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"></span><br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">And that's the end of this post. I hope you manage to do something similar with your own data. Till next time!</span></div>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">
</span><br />
<br />
<div style="text-align: left;">
<br /></div>Mannheimiahttp://www.blogger.com/profile/16733590560025537189noreply@blogger.com9tag:blogger.com,1999:blog-2548012305927658649.post-79072086478132191772012-05-20T22:27:00.000+03:002012-05-20T22:43:33.722+03:00Coroutines in Python: an example applied to Bioinformatics.<div style="text-align: justify;">
<span style="font-size: large;">Hello again! The first post of this blog dealed with a <a href="http://mannheimiagoesprogramming.blogspot.de/2012/04/uniprot-keylist-file-parser-in-python.html">UniProt keylist parser</a>, where I told you that we could mine information like the one store in the annotation field of each record with some additional scripts. Well this current post deals with it. The code presented at the bottom can actually be simpler, but my aim is to introduce a non very common Python object called coroutine.</span></div>
<div style="text-align: justify;">
<span style="font-size: large;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-size: large;">Coroutines are similar to functions and generators. However, whereas these operate on a single set of input arguments, coroutines can take a sequence of inputs sent to them. Indeed, coroutines are functions created by using the <span style="color: orange;">yield</span> statement (as generators) but written as an expression '(<span style="color: orange;">yield</span>)'. Let's see an example of <b>David Beazley </b>showed in the fourth edition of his book <b>Python Essential Reference</b>:</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="color: orange;">def</span> <span style="color: blue;">print_matches</span>(matchtext):</div>
<div style="text-align: justify;">
<span style="color: orange;">print</span> <span style="color: #38761d;">"Looking for"</span>, matchtext</div>
<div style="text-align: justify;">
<span style="color: orange;">while</span> <span style="color: purple;">True</span>:</div>
<div style="text-align: justify;">
line = (<span style="color: orange;">yield</span>) <span style="color: red;"> # Get a line of a text</span></div>
<div style="text-align: justify;">
<span style="color: orange;">if</span> matchtext <span style="color: orange;">in</span> line:</div>
<div style="text-align: justify;">
<span style="color: orange;">print</span> line</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-size: large;">To use a coroutine, you first call it with the <b>next</b> method, and then start sending arguments into it using the <b>send</b> method.</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
>>> matcher = print_matches("python")</div>
<div style="text-align: justify;">
>>> matcher.next()</div>
<div style="text-align: justify;">
<span style="color: blue;">Looking for python</span></div>
<div style="text-align: justify;">
>>> matcher.send(<span style="color: #38761d;">"Hello World"</span>) <span style="color: red;"># Generates no output because the text 'python' is not in the given string</span></div>
<div style="text-align: justify;">
>>> matcher.send(<span style="color: #38761d;">"python is cool"</span>)</div>
<div style="text-align: justify;">
<span style="color: blue;">python is cool</span></div>
<div style="text-align: justify;">
>>> matcher.send(<span style="color: #38761d;">"Yeah!"</span>)</div>
<div style="text-align: justify;">
>>> matcher.close() <span style="color: red;"># Done with the matcher function call</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-size: large;">Coroutines are useful when writing programs that run in parallel, where one part of a program cosumes the information generated by the other. Like generators, coroutines can be very useful for processing information produced by pipelines, streams or data flow. However in this post I will use it in an introductory way for you to observe how it works. Let's suppose we have a UniProt keylist file from which we want to know how many proteins are annotated as residing in the nucleus, cytoplasm or any other subcellular location. Our <b>parser</b> generator defined in a <a href="http://mannheimiagoesprogramming.blogspot.de/2012/04/uniprot-keylist-file-parser-in-python.html">previous post</a> can extract the annotations stored in the <span style="color: #38761d;">"CC"</span> field of any record. The <span style="color: #6aa84f;">"CC"</span> field contains information like the function of a protein, tissue specificity and subcellular location.</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
>>> re[<span style="color: #6aa84f;">"CC"</span>]</div>
<div style="text-align: justify;">
'-!- FUNCTION: May act as a transcriptional repressor. Down-regulates CA9 expression. -!- SUBUNIT: Interacts with HDAC4. <span style="background-color: yellow;">-!- SUBCELLULAR LOCATION: Nucleus. Cytoplasm, cytosol. Note=Mainly located in the nucleus.</span> -!- ALTERNATIVE PRODUCTS: Event=Alternative splicing; Named isoforms=2; Name=1; IsoId=Q9Y6X9-1; Sequence=Displayed; Note=No experimental confirmation available; Name=2; IsoId=Q9Y6X9-2; Sequence=VSP_041759; Note=No experimental confirmation available; -!- TISSUE SPECIFICITY: Highly expressed in smooth muscle, pancreas and testis. -!- SIMILARITY: Contains 1 CW-type zinc finger. -!- SEQUENCE CAUTION: Sequence=AAC12954.1; Type=Erroneous gene model prediction; Sequence=BAA74875.2; Type=Miscellaneous discrepancy; Note=Intron retention; ----------------------------------------------------------------------- Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms Distributed under the Creative Commons Attribution-NoDerivs License -----------------------------------------------------------------------'</div>
<div style="text-align: justify;">
>>> </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-size: large;">In the topic of this post, we are interested in the subcellular location of our proteins. The python code to work on that piece of the "CC" field is the following:</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
</div>
<span style="color: #6aa84f;">''' This program looks for a specific text in the subcellular location field</span><br />
<span style="color: #6aa84f;"> of the proteins stored in a UniProtKB keylist file. If requested, those</span><br />
<span style="color: #6aa84f;"> entries with no subcellular location annotation at all are written in a</span><br />
<span style="color: #6aa84f;"> file.'''</span><br />
<span style="font-size: x-small;"><br /></span><br />
subloc = 0<br />
<span style="color: orange;">def</span> <span style="color: blue;">subloc_counter</span>(matchtext): <span style="color: red;"># This a coroutine (similar to, but neither a</span><br />
<span style="color: orange;">global</span> subloc <span style="color: red;"># function nor a generator).</span><br />
<span style="color: orange;">while</span> <span style="color: purple;">True</span>:<br />
string = (<span style="color: orange;">yield</span>) <span style="color: red;"># Incorporates the input to be evaluated.</span><br />
<span style="color: orange;">if</span> matchtext <span style="color: orange;">in</span> string:<br />
subloc += 1<br />
<br />
<span style="color: red;"># Now let's read the KeyList file using the generator defined in UniProt_parser.py</span><br />
<span style="color: orange;">from</span> UniProt_parser <span style="color: orange;">import</span> *<br />
handle = <span style="color: purple;">open</span>(<span style="color: purple;">raw_input</span>(<span style="color: #6aa84f;">"Path and name of the keylist file: "</span>))<br />
records = parse(handle)<br />
<br />
<span style="color: red;"># Proteins with non subcellular location annotation might be interesting.</span><br />
store = <span style="color: purple;">raw_input</span>(<span style="color: #6aa84f;">"Do you wanna store the IDs of those proteins without \</span><br />
<span style="color: #6aa84f;">subcellular location annotation? Answer yes/no: "</span>)<br />
<span style="color: orange;">if</span> store == <span style="color: #6aa84f;">"yes"</span>:<br />
output = <span style="color: purple;">open</span>(<span style="color: purple;">raw_input</span>(<span style="color: #6aa84f;">"Where to store the entries without subcell location? "</span>), <span style="color: #6aa84f;">'w'</span>)<br />
<br />
<span style="color: red;"># Now, let's count the matches to an input text:</span><br />
text = <span style="color: purple;">raw_input</span>(<span style="color: #6aa84f;">"Enter the text you want to look for in the subcellular location field: "</span>)<br />
counter = subloc_counter(text)<br />
counter.next() <span style="color: red;"># Preparing for counting</span><br />
<br />
<span style="color: orange;">for</span> re <span style="color: orange;">in</span> records:<br />
SL = re[<span style="color: #6aa84f;">"CC"</span>].split(<span style="color: #6aa84f;">"-!- "</span>) <span style="color: red;"># The type of output return by the split function is a list.</span><br />
<span style="color: orange;">for</span> item in <span style="color: orange;">SL</span>:<br />
<span style="color: orange;">if </span>item.startswith(<span style="color: #6aa84f;">"SUBCELLULAR LOCATION"</span>): <span style="color: red;"># If subloc exists, then keep only the</span><br />
SL = item.lower() <span style="color: red;"># lines related to it. And put it in lowercase so to avoid</span><br />
<span style="color: orange;"> break </span> <span style="color: red;"> # case-sensitivity.</span><br />
<span style="color: orange;"> else</span>:<br />
<span style="color: orange;">pass</span><br />
<br />
<span style="color: orange;"> if</span> store == <span style="color: #6aa84f;">"yes"</span> <span style="color: orange;">and</span> type(SL) == <span style="color: purple;">list</span>: <span style="color: red;"># If SL is still a list, it means that non</span><br />
output.write(re[<span style="color: #6aa84f;">"ID"</span>].split(<span style="color: #6aa84f;">" "</span>)[0]+<span style="color: #6aa84f;">'\n'</span>) <span style="color: red;"># subloc annotation was found. Then store</span><br />
<span style="color: red;"># that protein in the output file.</span><br />
<span style="color: orange;">else</span>:<br />
counter.send(SL) <span style="color: red;"># Look for the text in the current SubLoc field.</span><br />
<br />
<span style="color: red;"># After searching in the annotation of all proteins:</span><br />
<span style="color: orange;">print</span> <span style="color: #6aa84f;">"The text '%s' was found in the subcellular location field of %d proteins."</span> \<br />
% (text, subloc)<br />
<br />
<span style="color: red;"># And close the coroutine and some still open files:</span><br />
counter.close()<br />
handle.close()<br />
<span style="color: orange;">if </span>store == <span style="color: #6aa84f;">"yes"</span>: output.close()<br />
<br />
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-size: large;">When being run in the python shell, the user is requested for certain information and files, and after it the program outputs the number of proteins localized in a particular subcellular location. Let's look how it works.</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
</div>
>>> ====================== RESTART ==========================<br />
>>><br />
<span style="color: blue;">Path and name of the keylist file: </span>C:\Users\Eduardo\Downloads\NU_keylist.txt<br />
<span style="color: blue;">Do you wanna store the IDs of those proteins without subcellular location annotation? Answer yes/no: </span>no<br />
<span style="color: blue;">Enter the text you want to look for in the subcellular location field: </span>nucleus<br />
<span style="color: blue;">The text 'nucleus' was found in the subcellular location field of 1159 proteins.</span><br />
>>><br />
<br />
<span style="font-size: large;">As you can see, first the program ask for the path and name of the keylist file. In my case my kelist file is 'NU_keylist.txt' and it's located in 'C:\Users\Eduardo\Downloads'. The answer of the second question depends on the interest of the user; some proteins are poorly annotated and they don't even have subcellular location information. However, if, for instance, the keylist file of the researcher contains proteins that were found in an experimental nuclear fraction, then those poorly annotated hits may be potential nuclear proteins. Let's continue with the code. If the given answer is 'yes', then a path is asked for storing an output file that will contain those potential new candidates. The last question is the most important, and for being completely helpful the user must know exactly the type of phrases that UniProtKB uses to annotated the subcellular location information of a protein. In other words, the parental and child terms (e.g., nucleus, nucleus inner membrane, nuclear pore complex, nucleolus, etc). Finally, the program prints the number of proteins that contain the given text in their subcellular location field. In my example, 1159 proteins of my keylist file reside in the nucleus.</span><br />
<span style="font-size: large;"><br /></span><br />
<span style="font-size: large;">That's all for this post. See you next time!</span>Mannheimiahttp://www.blogger.com/profile/16733590560025537189noreply@blogger.com0tag:blogger.com,1999:blog-2548012305927658649.post-19863598615405304862012-05-05T20:17:00.001+03:002012-06-03T15:28:23.783+03:00Agilent microarray data analysis in R<div style="text-align: justify;">
<span style="font-size: large;">A month ago I started a collaboration with some researchers of my home university in Peru. They are doing some experiments with <i>Aspergillus niger</i> to observe its gene expression under different substrates and types of cultures. They are evaluating two substrates: lactose and xylose; and two types of culture: biofilms and submerged culture. Thus their experimental flasks were labeled something like this: CBL, CSL, CBX and CSX from their Spanish abbreviation: C for culture, B for biofilm, S for submerged, L for lactose and X for xylose.</span></div>
<br />
<div style="text-align: justify;">
<span style="font-size: large;">As you may know already, R is a free statistical software that, among many other packages, has microarray packages developed by the Bioconductor group and collaborators. For the analysis of Agilent data that has been obtained from the proprietary <i>Agilent Feature Extraction</i> (AFE) image analysis algorithm, readers can refer to <a href="http://www.biomedcentral.com/1471-2164/12/64">http://www.biomedcentral.com/1471-2164/12/64</a> and make use of the <i>AgiMicroRna</i> library in R. However, in this post I deal with Agilent data in a more general manner. It means that the code that will be presented a few lines below can be used for data obtained from AFE, but also for data that is not. Let's see!</span></div>
<div style="text-align: justify;">
<span style="font-size: large;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-size: large;">First of all, we type in the working directory and upload the limma package:</span></div>
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"># Set the working directory or path where your data is localized.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> setwd("C:/...")</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> library(limma)</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<div style="text-align: justify;">
<span style="font-size: large;">At this point I have to make an interruption to say that the analysis of Agilent data requires a so called <b>"target file"</b>, which is just a <b>tab delimited text file</b> created by the user, and containing the experimental desing. Have a look at the one for my <i>A. niger</i> experiment:</span></div>
<div style="text-align: justify;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: 'Courier New', Courier, monospace;">SampleNumber </span><span class="Apple-tab-span" style="font-family: 'Courier New', Courier, monospace; white-space: pre;"> </span><span style="font-family: 'Courier New', Courier, monospace;">FileName </span><span style="font-family: 'Courier New', Courier, monospace;">Condition</span></div>
<span style="font-family: 'Courier New', Courier, monospace;">1<span class="Apple-tab-span" style="white-space: pre;"> </span> cbl1.txt CBL</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">2<span class="Apple-tab-span" style="white-space: pre;"> </span> cbl2.txt CBL</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">3<span class="Apple-tab-span" style="white-space: pre;"> </span> cbl3.txt CBL</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">4<span class="Apple-tab-span" style="white-space: pre;"> </span> cbx1.txt CBX</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">4<span class="Apple-tab-span" style="white-space: pre;"> </span> cbx2.txt CBX</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">6<span class="Apple-tab-span" style="white-space: pre;"> </span> cbx3.txt CBX</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">7<span class="Apple-tab-span" style="white-space: pre;"> </span> csl1.txt CSL</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">8<span class="Apple-tab-span" style="white-space: pre;"> </span> csl2.txt CSL</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">9<span class="Apple-tab-span" style="white-space: pre;"> </span> csl3.txt CSL</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">10<span class="Apple-tab-span" style="white-space: pre;"> </span> csx1.txt CSX</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">11<span class="Apple-tab-span" style="white-space: pre;"> </span> csx2.txt CSX</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">12<span class="Apple-tab-span" style="white-space: pre;"> </span> csx3.txt CSX</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<div style="text-align: justify;">
<span style="font-size: large;">As you can see, in my experiment, I have 12 samples representing 4 conditions or treatments (i.e., CBL, CBX, CSL and CSX), and each condition has 3 replicates whose intensity signals are store in different files output by the scanner or any other Agilent image analyser. I have simply called my target file "target.txt" and store it in the current working directory. So now we can continue with the R code:</span></div>
<div style="text-align: justify;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: 'Courier New', Courier, monospace;">> targets = readTargets("targets.txt")</span></div>
<br />
<span style="font-family: 'Courier New', Courier, monospace;">> raw = read.maimages(targets, source="agilent",green.only=TRUE)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"># Subtract the background signal.</span>
<br />
<span style="font-family: 'Courier New', Courier, monospace;">> raw_BGcorrected = backgroundCorrect(raw, method="normexp", offset=16)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"># Then normalize and log-transformed the data.</span>
<br />
<span style="font-family: 'Courier New', Courier, monospace;">> raw_BGandNormalized = normalizeBetweenArrays(raw_BGcorrected,method="quantile")</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"># Finally calculate the average intensity values from the probes of each gene. </span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> raw_aver = avereps(raw_BGandNormalized,ID=raw_BGandNormalized$genes$ProbeName)</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<div style="text-align: justify;">
<span style="font-size: large;">We can also asses the quality of the data before and after normalization. For that, we can use boxplots or MA plots:</span></div>
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"># Before normalization.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> boxplot(log(as.matrix(raw_BGcorrected)),las=2,ylab="Log2(Intensity)")</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"># After it.</span><span style="font-family: 'Courier New', Courier, monospace;"> </span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> boxplot(as.matrix(raw_BGandNormalized), las=2, ylab = "Log2(Intensity)")</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"># Now some MA plots of only one replicate per condition:</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> library(affy)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> mva.pairs(as.matrix(raw)[,c(1,4,7,10)]) # Before BG correction</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> mva.pairs(as.matrix(raw_BGcorrected)[,c(1,4,7,10)]) # Before normalization. </span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> mva.pairs(as.matrix(raw_BGandNormalized)[,c(1,4,7,10)]) # After it.</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<div style="text-align: justify;">
<span style="font-size: large;">After that we can proceed with the differential expression analysis. For that we need to create a design matrix and a contrast matrix. For the former, the user can use the information already store in the target file:</span></div>
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> f = factor(targets$Condition)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> design = model.matrix(~f)</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<div style="text-align: justify;">
<span style="font-size: large;">Or to make this post more didactic erase the above two lines and let's create the design matrix in a different way:</span></div>
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> design = cbind(CBL = c(1,1,1,0,0,0,0,0,0,0,0,0), # First 3 replicates -> CBL</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> CBX = c(0,0,0,1,1,1,0,0,0,0,0,0), # The following 3 -> CBX</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> CSL = c(0,0,0,0,0,0,1,1,1,0,0,0), # The following 3 -> CSL</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> CSX = c(0,0,0,0,0,0,0,0,0,1,1,1)) # Last 3 replicates -> CSX</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"># Now fit the average intensity data to the design matrix: </span>
<br />
<span style="font-family: 'Courier New', Courier, monospace;">> fit = lmFit(raw_aver, design)</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<div style="text-align: justify;">
<span style="font-size: large;">Then we create the constrast matrix for each comparison we are interested in and asses for the differences in expression using a t-test.</span> </div>
<br />
<span style="font-family: 'Courier New', Courier, monospace;">> contrastMatrix = makeContrasts("CBL-CSL","CBX-CSX","CBL-CBX","CSL-CSX", </span><span style="font-family: 'Courier New', Courier, monospace;">levels=design)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> fit2 = contrasts.fit(fit, contrastMatrix)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> fit2 = eBayes(fit2)</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<div style="text-align: justify;">
<span style="font-size: large;">Finally we might want to see the top ten significant hits in each comparison, or observe the highly significant or all the significant hits, or we may want to store the upregulated and downregulated hits in different files. For that we just need to type in the final lines:</span></div>
<div style="text-align: justify;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: 'Courier New', Courier, monospace;"># Now let's look for the top ten signficants hits in each comparison:</span></div>
<span style="font-family: 'Courier New', Courier, monospace;">> topTable(fit2, coef = "CBL-CSL") # The ten top significants in CBL vs CSL.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> topTable(fit2, coef = "CBX-CSX") # The ten top significants in CBX vs CSX.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> topTable(fit2, coef = "CBL-CBX") # The ten top significants in CBL vs CBX.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> topTable(fit2, coef = "CSL-CSX") # The ten top significants in CSL vs CSX.</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"># The significant hits based on adjusted p-value for the comparison CBL vs CSL: </span><span style="font-family: 'Courier New', Courier, monospace;"> </span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> sig = length(which(topTable(fit2, coef = "CBL-CSL",number=42370)[,15]<0.05)) </span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> signif = topTable(fit2, coef = "CBL-CSL",number=sig)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> upregulated = signif[which(signif[,11]>0),] # The upregulated hits in CBL.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> downregulated = signif[which(signif[,11]<0),] # The downregulated ones.</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"># Save them in different files for future annotation or functional cluster</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"># analysis:</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> write.table(upregulated, "CBLvsCSL_Upre.txt", sep="\t")</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">> write.table(downregulated, "CBLvsCSL_Downre.txt", sep="\t")</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<div style="text-align: justify;">
<span style="font-size: large;">To end this post I just want to make some notes. Firstly, the parameters of the topTable function can be studied by just typing in </span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">> help(topTable)</span><span style="font-size: large;">. For my A. niger data, I have used </span><span style="font-size: x-small;"><span style="font-family: 'Courier New', Courier, monospace;">number=42370</span><span style="font-family: 'Courier New', Courier, monospace;"> </span></span><span style="font-size: large;">because the number of probes in my microarrays are less but close to that number. Secondly, the indexes </span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">[,15]</span><span style="font-size: large;"> and </span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">[,11]</span><span style="font-size: large;"> may change for the user's data depending on the column localization of the adjusted p-value and the fold change, respectively, in the user's own data frame of significant hits. Finally, with some more scripts the user can also make heatmaps and cluster analyses. That might be a topic for another post in this blog. Till next time! </span></div>
<br />Mannheimiahttp://www.blogger.com/profile/16733590560025537189noreply@blogger.com1tag:blogger.com,1999:blog-2548012305927658649.post-49469487471734019872012-04-23T05:17:00.000+03:002012-05-20T22:28:37.822+03:00A UniProt KeyList file parser in Python<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">In this first post, I'd like to share a parser that I wrote some months ago when working with proteomics data.</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Usually a researcher wants to know what a protein does and therefore goes to UniProtKB and searches for it using its AC number (e.g., Q8NFH3) or its ID (e.g, NUP43_HUMAN) and reads the available information that goes from the protein name, its description, function, sequence and links to other databases. Others the researcher wants to do a batch search and so uses the "Retrieve" tool of UniProtKB that outputs different types of file (e.g., FASTA, GFF, XML, Flat Text, etc.) with different information each. The Flat Text file is the most informative one since it contains all the information displayed in the interface of a basic UniProtKB search. A single record of this file looks like this:</span></div>
<br />
<br />
<pre style="white-space: pre-wrap; word-wrap: break-word;">ID RBM47_HUMAN Reviewed; 593 AA.
AC A0AV96; A0PJK2; B5MED4; Q8NI52; Q8NI53; Q9NXG3;
DT 23-OCT-2007, integrated into UniProtKB/Swiss-Prot.
DT 30-NOV-2010, sequence version 2.
DT 18-APR-2012, entry version 46.
DE RecName: Full=RNA-binding protein 47;
DE AltName: Full=RNA-binding motif protein 47;
GN Name=RBM47; </pre>
<span style="white-space: pre-wrap;">:</span><br />
:<br />
<pre style="white-space: pre-wrap; word-wrap: break-word;">CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=2;
CC Name=1;
CC IsoId=A0AV96-1; Sequence=Displayed;
CC Name=2;
CC IsoId=A0AV96-2; Sequence=VSP_028839;
CC -!- SIMILARITY: Belongs to the RRM RBM47 family.
CC -!- SIMILARITY: Contains 3 RRM (RNA recognition motif) domains. </pre>
:<br />
:<br />
:<br />
<pre style="white-space: pre-wrap; word-wrap: break-word;">SQ SEQUENCE 593 AA; 64099 MW; AEA061F89A68010B CRC64;
MTAEDSTAAM SSDSAAGSSA KVPEGVAGAP NEAALLALME RTGYSMVQEN GQRKYGGPPP
GWEGPHPQRG CEVFVGKIPR DVYEDELVPV FEAVGRIYEL RLMMDFDGKN RGYAFVMYCH
:</pre>
<pre style="white-space: pre-wrap; word-wrap: break-word;"> :</pre>
<pre style="white-space: pre-wrap; word-wrap: break-word;"> KVPEGVAGAP</pre>
<pre style="white-space: pre-wrap; word-wrap: break-word;">//</pre>
<br />
<br />
<br />
<div style="text-align: justify;">
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">As seen above, each record has a series of keywords (e.g., <span style="white-space: pre-wrap;">ID, AC, DE, CC, SQ, etc.) that store particular type of information of the protein. Because of these keywords, this type of file is commonly known as KeyList file, and thanks to them the file is easy to parse and so extract information record-wise and mine it. For example one can have a KeyList file with thousands of records and wants to extract their descriptions or all the accession numbers associated with each ID or, even more important, information like the functions </span><span style="white-space: pre-wrap;">or subcellular locations </span><span style="white-space: pre-wrap;">of the records. The code I wrote for parsing through this kind of file is the following:</span></span></div>
</div>
<div style="text-align: justify;">
<span style="white-space: pre-wrap;"><br /></span></div>
<span style="white-space: pre-wrap;"><br /></span><br />
<span style="white-space: pre-wrap;"><span style="color: orange;">class</span> <span style="color: blue;">Record</span>(<span style="color: purple;">dict</span>):
<span style="color: #38761d;"> """
This record stores the information of one keyword or category in the
keywlist.txt as a Python dictionary. The keys in this dictionary are
the line codes that can appear in the keywlist.txt file:
--------- --------------------------- ----------------------
Line code Content Occurrence in an entry
--------- --------------------------- ----------------------
ID Identifier (keyword) Once; starts a keyword entry.
AC Accession (KW-xxxx) Once.
DE Definition Once or more.
CC Subcellular Location Once or more; comments.
SQ Sequence Once; contains only the heading information.
"""</span>
<span style="color: orange;">def</span> <span style="color: blue;">__init__</span>(self):
<span style="color: purple;">dict.</span>__init__(self)
<span style="color: orange;">for</span> keyword <span style="color: orange;">in </span>(<span style="color: #38761d;">"DE"</span>,<span style="color: #38761d;"> "CC"</span>):
self[keyword] = []
<span style="color: orange;">def </span><span style="color: blue;">parse</span>(handle): <span style="color: red;"># The parameter handle is the UniProt KeyList file.</span>
record = Record()
# Now parse the records
<span style="color: orange;"> for</span> line <span style="color: orange;">in</span> handle:
key = line[:2]
<span style="color: orange;">if</span> key==<span style="color: #38761d;">"//"</span>: <span style="color: red;"># The last line of the current record has been reached.</span>
record[<span style="color: #38761d;">"DE"</span>] = " ".join(record[<span style="color: #38761d;">"DE"]</span>)
record[<span style="color: #38761d;">"CC"</span>] = " ".join(record[<span style="color: #38761d;">"CC"</span>])
<span style="color: orange;">yield</span> record <span style="color: red;"># So we output the record and pass to other one. </span>
record = Record()
<span style="color: orange;">elif</span> line[2:5]==<span style="color: #38761d;">" "</span>: <span style="color: red;"># If not, we continue recruiting the information.</span>
value = line[5:].strip()
<span style="color: orange;">if</span> key <span style="color: orange;">in</span> (<span style="color: #38761d;">"ID"</span>, <span style="color: #38761d;">"AC"</span>,<span style="color: #38761d;"> "SQ"</span>):
record[key] = value
<span style="color: orange;">elif</span> key <span style="color: orange;">in</span> (<span style="color: #38761d;">"DE"</span>, <span style="color: #38761d;">"CC"</span>):
record[key].append(value)
<span style="color: orange;">else</span>:
<span style="color: orange;">pass</span>
<span style="color: red;"># Read the footer and throw it away</span>
<span style="color: orange;">for</span> line <span style="color: orange;">in</span> handle:
<span style="color: orange;">pass</span></span><br />
<span style="white-space: pre-wrap;"><span style="color: orange;"><br /></span></span><br />
<span style="white-space: pre-wrap;"><span style="color: orange;"><br /></span></span><br />
<div style="text-align: justify;">
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large; white-space: pre-wrap;">You can copy this script and save it to a python file called for instance <b>UniProt_parser.py</b> and then use it as a module in the current python shell or any other new script using the <span style="color: orange;">import</span> tool. Something like this:</span></div>
</div>
<span style="white-space: pre-wrap;"> </span><br />
<span style="white-space: pre-wrap;"><br /></span><br />
<span style="white-space: pre-wrap;"><span style="color: orange;">from</span> UniProt_parser <span style="color: orange;">import</span> *
handle = <span style="color: purple;">open</span>(<span style="color: #38761d;">"Name of the UniProt keylist file"</span>)
records = parse(handle) <span style="color: red;"># Uses the function 'parse' from the module.</span> </span><br />
<span style="white-space: pre-wrap;"><span style="color: orange;">for</span> record <span style="color: orange;">in</span> records:</span><br />
<span style="white-space: pre-wrap;"> <span style="color: purple;">print</span> record[<span style="color: #38761d;">"ID"</span>]
<span style="color: purple;">print</span> record[<span style="color: #38761d;">"AC"</span>]
<span style="color: purple;">print</span> record[<span style="color: #38761d;">"CC"</span>]
<span style="color: purple;">print</span> record[<span style="color: #38761d;">"SQ"</span>]</span><br />
<span style="white-space: pre-wrap;"><br /></span><br />
<span style="white-space: pre-wrap;"><br /></span><br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large; white-space: pre-wrap;">With a bit more of scripting lines, the parser can be use for mining the information, for example to know how many proteins have subcellular location in the membrane, nucleus, mitochondrion, etc. Or retrieve the molecular weigth and/or sequence length of the protein and store them in a file.</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large; white-space: pre-wrap;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-size: large; white-space: pre-wrap;"><span style="font-family: Arial, Helvetica, sans-serif;">That's it for this post. I hope it was useful. Till next time! </span> </span></div>
<br />Mannheimiahttp://www.blogger.com/profile/16733590560025537189noreply@blogger.com0