I am currently trying to parse the Reactome.org owl database file into a format more suited for my needs. So far I have been experiencing some major difficulties, because of lack of rigor in organisation of classes in this ressource, at least in the biopax .owl export file.
First, obscure use of the “memberPhysicalEntity” attribute. Some of the proteins, complexes and smallMolecules are in fact whole classes of proteins, with functions often non-defined and metionned in the reactions. Which means I have to find them out by hand and use proxy objects (which are not real groups of proteins).
Second, mixture between:
- Physical entities defined by a unique structure, for instance proteins as defined by a uniprot
- Instantiated physical entities: the ones that contain a post-translation modification or are localized to a particular cellular compartment
- Fragments of Physical entities, for instance alpha chains of different proteins
- Collections of Fragments of physical entities, such as collection of all the alpha chains found within the database
Third, some redundancy in use of owl terms. For instance Catalysis, Control, TemplateReactionRegulation and Degradation are all used in a similar fashion, even if Catalysis is modulation of kinetic bareer in reaction and the other can actually completely perturbate the reaction. What is the reason of redundancy of terms? It is not very clear…
Forth, lack of information essential for a class in the class description, aka “headless Horseman” problem:
- Post-translational modifications on a protein, provided without location (i.e. somewhere) and without type (some modification).
Last, but not least, lots of “floating” compounds. There are about 6000 compounds (Proteins, Complexes or PhysicalEntities) that are pointed towards only by only one unique reaction. It means they participate to no other reaction and regulate no other reaction, except for only one. Which seems quite unrealistic to my eyes.
I have spend now about two weeks to get it all in order and I’ll try to publish the resulting cleaned file once I am done.