Reproducability in the High-throughtput and computational biology:

Just discovered about the  Potti scandal at Duke (primer for those who have never heard about it before from here: http://en.wikipedia.org/wiki/Anil_Potti)

Currently watching http://videolectures.net/cancerbioinformatics2010_baggerly_irrh/. Some of the extraordinary quotes (approximative though):

If, after a computational analysis, you give a biologist a single gene, unrelated to cancer until now, that correlates the increase of risk of cancer, it is most likely that you would hear something like “No, you’ve got stroma contamination over here: I’ve been studying this gene for years now and I perfectly know that it is completely uncorrelated with cancer”

If, after a computational analysis, you give a biologist a list of hundreds of genes, and you say: here is the genetic signature of cancer, it is most likely that he will just agree with you, because “yeah, this one seems to correlate with that one, so yeah, that makes sense”.

=> This is precisely why I am developping the information flow framework for drug discovery and clinical biology; to make biological sense from the lists of hundreds of perturbed genes.

Forensic Bioinformatics: Here is the raw data, here is the final results. Let’s try to figure out how we get from the raw data to the results, disregarding what they said they did in supdata.

=> Idea: use the chemotherapeutic drug against 60 cell lines pannel to determine specificity  and see if it correlates with the biological knowledge we have about those  cell lines

Let’s use metagenes!!! As matematicians, we know them as PCA, but well, let’s call them metagenes.

Their list and ours: you might see the pattern. Yes, the genes are IDs are off-set by 1.

So, we had a look at the software they were using and it’s documentation. if you want to read the docs, go to my website, because it was me who wrote it, since there were none!

Most of review commitees in biological journals are biologists, they will skip all the part related to the microarray analysis, jump to the results and see if the computational biology results are in agreement with wet lab results.

 

Murky waters of systems bilogy

I am currently trying to parse the Reactome.org owl database file into a format more suited for my needs. So far I have been experiencing some major difficulties, because of lack of rigor in organisation of classes in this ressource, at least in the biopax .owl export file.

First, obscure use of the “memberPhysicalEntity” attribute. Some of the proteins, complexes and smallMolecules are in fact whole classes of proteins, with functions often non-defined and metionned in the reactions. Which means I have to find them out by hand and use proxy objects (which are not real groups of proteins).

Second, mixture between:

  • Physical entities defined by a unique structure, for instance proteins as defined by a uniprot
  • Instantiated physical entities: the ones that contain a post-translation modification or are localized to a particular cellular compartment
  • Fragments of Physical entities, for instance alpha chains of different proteins
  • Collections of Fragments of physical entities, such as collection of all the alpha chains found within the database

Third, some redundancy in use of owl terms. For instance Catalysis, Control, TemplateReactionRegulation and Degradation are all used in a similar fashion, even if Catalysis is modulation of kinetic bareer in reaction and the other can actually completely perturbate  the reaction. What is the reason of redundancy of terms? It is not very clear…

Forth, lack of information essential for a class in the class description, aka “headless Horseman” problem:

  • Post-translational modifications on a protein, provided without location (i.e. somewhere) and without type (some modification).

Last, but not least, lots of “floating” compounds. There are about 6000 compounds (Proteins, Complexes or PhysicalEntities) that are pointed towards only by only one unique reaction. It means they participate to no other reaction and regulate no other reaction, except for only one. Which seems quite unrealistic to my eyes.

I have spend now about two weeks to get it all in order and I’ll try to publish the resulting cleaned file once I am done.