Mannheimia goes programming: Coroutines in Python: an example applied to Bioinformatics.

Hello again! The first post of this blog dealed with a UniProt keylist parser, where I told you that we could mine information like the one store in the annotation field of each record with some additional scripts. Well this current post deals with it. The code presented at the bottom can actually be simpler, but my aim is to introduce a non very common Python object called coroutine.

Coroutines are similar to functions and generators. However, whereas these operate on a single set of input arguments, coroutines can take a sequence of inputs sent to them. Indeed, coroutines are functions created by using the yield statement (as generators) but written as an expression '(yield)'. Let's see an example of David Beazley showed in the fourth edition of his book Python Essential Reference:

def print_matches(matchtext):

print "Looking for", matchtext

while True:

line = (yield) # Get a line of a text

if matchtext in line:

print line

To use a coroutine, you first call it with the next method, and then start sending arguments into it using the send method.

>>> matcher = print_matches("python")

>>> matcher.next()

Looking for python

>>> matcher.send("Hello World") # Generates no output because the text 'python' is not in the given string

>>> matcher.send("python is cool")

python is cool

>>> matcher.send("Yeah!")

>>> matcher.close() # Done with the matcher function call

Coroutines are useful when writing programs that run in parallel, where one part of a program cosumes the information generated by the other. Like generators, coroutines can be very useful for processing information produced by pipelines, streams or data flow. However in this post I will use it in an introductory way for you to observe how it works. Let's suppose we have a UniProt keylist file from which we want to know how many proteins are annotated as residing in the nucleus, cytoplasm or any other subcellular location. Our parser generator defined in a previous post can extract the annotations stored in the "CC" field of any record. The "CC" field contains information like the function of a protein, tissue specificity and subcellular location.

>>> re["CC"]

'-!- FUNCTION: May act as a transcriptional repressor. Down-regulates CA9 expression. -!- SUBUNIT: Interacts with HDAC4. -!- SUBCELLULAR LOCATION: Nucleus. Cytoplasm, cytosol. Note=Mainly located in the nucleus. -!- ALTERNATIVE PRODUCTS: Event=Alternative splicing; Named isoforms=2; Name=1; IsoId=Q9Y6X9-1; Sequence=Displayed; Note=No experimental confirmation available; Name=2; IsoId=Q9Y6X9-2; Sequence=VSP_041759; Note=No experimental confirmation available; -!- TISSUE SPECIFICITY: Highly expressed in smooth muscle, pancreas and testis. -!- SIMILARITY: Contains 1 CW-type zinc finger. -!- SEQUENCE CAUTION: Sequence=AAC12954.1; Type=Erroneous gene model prediction; Sequence=BAA74875.2; Type=Miscellaneous discrepancy; Note=Intron retention; ----------------------------------------------------------------------- Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms Distributed under the Creative Commons Attribution-NoDerivs License -----------------------------------------------------------------------'

>>>

In the topic of this post, we are interested in the subcellular location of our proteins. The python code to work on that piece of the "CC" field is the following:

''' This program looks for a specific text in the subcellular location field
of the proteins stored in a UniProtKB keylist file. If requested, those
entries with no subcellular location annotation at all are written in a
file.'''

subloc = 0
def subloc_counter(matchtext): # This a coroutine (similar to, but neither a
global subloc # function nor a generator).
while True:
string = (yield) # Incorporates the input to be evaluated.
if matchtext in string:
subloc += 1

# Now let's read the KeyList file using the generator defined in UniProt_parser.py
from UniProt_parser import *
handle = open(raw_input("Path and name of the keylist file: "))
records = parse(handle)

# Proteins with non subcellular location annotation might be interesting.
store = raw_input("Do you wanna store the IDs of those proteins without \
subcellular location annotation? Answer yes/no: ")
if store == "yes":
output = open(raw_input("Where to store the entries without subcell location? "), 'w')

# Now, let's count the matches to an input text:
text = raw_input("Enter the text you want to look for in the subcellular location field: ")
counter = subloc_counter(text)
counter.next() # Preparing for counting

for re in records:
SL = re["CC"].split("-!- ") # The type of output return by the split function is a list.
for item in SL:
if item.startswith("SUBCELLULAR LOCATION"): # If subloc exists, then keep only the
SL = item.lower() # lines related to it. And put it in lowercase so to avoid
break # case-sensitivity.
else:
pass

if store == "yes" and type(SL) == list: # If SL is still a list, it means that non
output.write(re["ID"].split(" ")[0]+'\n') # subloc annotation was found. Then store
# that protein in the output file.
else:
counter.send(SL) # Look for the text in the current SubLoc field.

# After searching in the annotation of all proteins:
print "The text '%s' was found in the subcellular location field of %d proteins." \
% (text, subloc)

# And close the coroutine and some still open files:
counter.close()
handle.close()
if store == "yes": output.close()

When being run in the python shell, the user is requested for certain information and files, and after it the program outputs the number of proteins localized in a particular subcellular location. Let's look how it works.

>>> ====================== RESTART ==========================
>>>
Path and name of the keylist file: C:\Users\Eduardo\Downloads\NU_keylist.txt
Do you wanna store the IDs of those proteins without subcellular location annotation? Answer yes/no: no
Enter the text you want to look for in the subcellular location field: nucleus
The text 'nucleus' was found in the subcellular location field of 1159 proteins.
>>>

As you can see, first the program ask for the path and name of the keylist file. In my case my kelist file is 'NU_keylist.txt' and it's located in 'C:\Users\Eduardo\Downloads'. The answer of the second question depends on the interest of the user; some proteins are poorly annotated and they don't even have subcellular location information. However, if, for instance, the keylist file of the researcher contains proteins that were found in an experimental nuclear fraction, then those poorly annotated hits may be potential nuclear proteins. Let's continue with the code. If the given answer is 'yes', then a path is asked for storing an output file that will contain those potential new candidates. The last question is the most important, and for being completely helpful the user must know exactly the type of phrases that UniProtKB uses to annotated the subcellular location information of a protein. In other words, the parental and child terms (e.g., nucleus, nucleus inner membrane, nuclear pore complex, nucleolus, etc). Finally, the program prints the number of proteins that contain the given text in their subcellular location field. In my example, 1159 proteins of my keylist file reside in the nucleus.

That's all for this post. See you next time!

Mannheimia goes programming

Sunday, May 20, 2012

Coroutines in Python: an example applied to Bioinformatics.

1 comment: