Reproducability in the High-throughtput and computational biology:

Just discovered about the  Potti scandal at Duke (primer for those who have never heard about it before from here: http://en.wikipedia.org/wiki/Anil_Potti)

Currently watching http://videolectures.net/cancerbioinformatics2010_baggerly_irrh/. Some of the extraordinary quotes (approximative though):

If, after a computational analysis, you give a biologist a single gene, unrelated to cancer until now, that correlates the increase of risk of cancer, it is most likely that you would hear something like “No, you’ve got stroma contamination over here: I’ve been studying this gene for years now and I perfectly know that it is completely uncorrelated with cancer”

If, after a computational analysis, you give a biologist a list of hundreds of genes, and you say: here is the genetic signature of cancer, it is most likely that he will just agree with you, because “yeah, this one seems to correlate with that one, so yeah, that makes sense”.

=> This is precisely why I am developping the information flow framework for drug discovery and clinical biology; to make biological sense from the lists of hundreds of perturbed genes.

Forensic Bioinformatics: Here is the raw data, here is the final results. Let’s try to figure out how we get from the raw data to the results, disregarding what they said they did in supdata.

=> Idea: use the chemotherapeutic drug against 60 cell lines pannel to determine specificity  and see if it correlates with the biological knowledge we have about those  cell lines

Let’s use metagenes!!! As matematicians, we know them as PCA, but well, let’s call them metagenes.

Their list and ours: you might see the pattern. Yes, the genes are IDs are off-set by 1.

So, we had a look at the software they were using and it’s documentation. if you want to read the docs, go to my website, because it was me who wrote it, since there were none!

Most of review commitees in biological journals are biologists, they will skip all the part related to the microarray analysis, jump to the results and see if the computational biology results are in agreement with wet lab results.

 

Mastering Groovy

So, since I want to work with neo4j through bulbs, it seems that I have no other option but to use Groovy Gremlin.

Installation of groovy on Eclipse: through marketplace. Quite easy.

First attempt to use: install gremlin from Tinkerpop and access it from Groovy programming shell in eclipse. After about an hour of furious googling, it seems that a couple of libraries need to be included in the groovy shell to launch the gremlin from within groovy:

gremlin$ groovysh -cp $GREMLIN_HOME/lib/gremlin-groovy-2.3.0.jar:$GREMLIN_HOME/lib/gremlin-java-2.3.0.jar:$GREMLIN_HOME/lib/pipes-2.3.0.jar:$GREMLIN_HOME/lib/common-1.7.jar:$GREMLIN_HOME/lib/groovy-1.8.9.jar

To do the same thing from Eclipse, Project>Properties>JavaBuild Path>Add External Jars and then add:

  • $GREMLIN_HOME/lib/gremlin-groovy-2.3.0.jar
  • $GREMLIN_HOME/lib/gremlin-java-2.3.0.jar
  • $GREMLIN_HOME/lib/pipes-2.3.0.jar
  • $GREMLIN_HOME/lib/common-1.7.jar
  • $GREMLIN_HOME/lib/groovy-1.8.9.jar
  • Murky waters of systems bilogy

    I am currently trying to parse the Reactome.org owl database file into a format more suited for my needs. So far I have been experiencing some major difficulties, because of lack of rigor in organisation of classes in this ressource, at least in the biopax .owl export file.

    First, obscure use of the “memberPhysicalEntity” attribute. Some of the proteins, complexes and smallMolecules are in fact whole classes of proteins, with functions often non-defined and metionned in the reactions. Which means I have to find them out by hand and use proxy objects (which are not real groups of proteins).

    Second, mixture between:

    • Physical entities defined by a unique structure, for instance proteins as defined by a uniprot
    • Instantiated physical entities: the ones that contain a post-translation modification or are localized to a particular cellular compartment
    • Fragments of Physical entities, for instance alpha chains of different proteins
    • Collections of Fragments of physical entities, such as collection of all the alpha chains found within the database

    Third, some redundancy in use of owl terms. For instance Catalysis, Control, TemplateReactionRegulation and Degradation are all used in a similar fashion, even if Catalysis is modulation of kinetic bareer in reaction and the other can actually completely perturbate  the reaction. What is the reason of redundancy of terms? It is not very clear…

    Forth, lack of information essential for a class in the class description, aka “headless Horseman” problem:

    • Post-translational modifications on a protein, provided without location (i.e. somewhere) and without type (some modification).

    Last, but not least, lots of “floating” compounds. There are about 6000 compounds (Proteins, Complexes or PhysicalEntities) that are pointed towards only by only one unique reaction. It means they participate to no other reaction and regulate no other reaction, except for only one. Which seems quite unrealistic to my eyes.

    I have spend now about two weeks to get it all in order and I’ll try to publish the resulting cleaned file once I am done.

    Python – D3.js integration

    Python provides a pretty coomon framework for biological data analysis. And D3.js is one of the most coomon plateforms for the visualisation of large massifs of data. So I have been looking for a way to make them work together. This post gives a pretty decent introduction to the d3.js visualisation for people totally unfamiliar with javascript. It also suggests a possible interfacting of d3.js with pyhton via a javascript pseudo-library for that is to be wrtitten to the root folder containing the html page in JSON format.

    An alternative approach is to is to send the JSON directly to a webpage javascript via the python build-in server, but this requires a little bit more work. In any case I will be looking more in depth at it shortly

    http://blog.nextgenetics.net/?e=7

    Update1: I tried to follow the path suggested in the link, it didn’t quite work for me

    Update2: And as usual, the problem was not with the tutorial but with chair-keyboard interface on the users side: I put d3.min.js library in a folder my server had no permission to go, so it didn’t  get loaded and the script didn’t get executed. Using instead the following snippet to import the d3.min.js library works perfectly fine:

    <script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>r