Using standardized bioinformatics formats in Taverna workflows for integrating biological data
The field of bioinformatics is replete with standard formats for representing biological data. Those standards that have been successfully adopted by the bioinformatics community are associated with software tools which can perform analyses, integration and visualization of data which comply with community-accepted formats. For example, the Systems Biology Markup Language (SBML) is very popular for representing biological systems since there are over 200 software packages supporting this standard (Hucka and Le Novere, 2010).
The existence of standards compliant software provides an incentive for representing biological information in standard data formats. The integration and transformation of data into these formats can be achieved using programming libraries. For example, libSBML provides an application programming interface (API) for creating and manipulating biological networks represented in SBML (Bornstein et al., 2008).
There are file formats for representing sequence data contained within the EMBL and UNIPROT databases. Furthermore, you can integrate related data into sequences that are in these EMBL and UNIPROT file formats. Genome variation SNP data, for example, can be incorporated into a genomic sequence that has been represented within an EMBL file. Programming libraries such as BioJava provide operations for integrating such information into data formats (Holland et al., 2008). Once this is done then its possible to use tools which comply with file formats to enable the visualization and analysis of the combined sequence data.
The problem is how do you make all of these programming libraries to interoperate with each other? The obvious way is to write code to develop an application which performs a pipeline for integrating and converting raw data into a standard format thereby allowing it to be further processed and visualized. An alternative is to use a workflow system which provides a visual software programming environment that allows you to build pipelines in a graphical fashion. An example of such a workflow system is Taverna which also enables operations from different programming libraries to be combined either directly using its API consumer application or indirectly via web services (Oinn et al., 2004; Li et al., 2008). Such a workflow is shown below which maps feature information with a protein sequence represented as a Uniprot record and then visualize the amalgated data using a third party piece of software:
Sequence features in the form of functional domains are identified within a protein using a web service tool called InterproScan. The results from InterproScan specify the type of functional domain identified and a range location that specifies the position of the motif in the protein sequence. These data are merged into the Uniprot sequence record using BioJava code exposed as a web service by the addInterproFeatures step in the workflow. By doing this, the functional motifs identified by InterProScan can then be viewed in relation to the whole of the protein using a sequence browsing tool, SeqVista (Hu et al., 2003), which understands sequence data in common file formats like Uniprot and GenBank from within the Taverna workbench:
The above sequence shows a fibronectin domain (fn3) highlighted in a tyrosine-kinase receptor called TIE1 which has been identified by InterProScan using HMMPfam.
Bornstein BJ, Keating SM, Jouraku A, Hucka M. (2008) LibSBML: an API library for SBML. Bioinformatics. 24(6):880-1.
Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K, Foisy S, Dräger A, Yates A, Heuer M, Schreiber MJ. (2008) BioJava: an open-source framework for bioinformatics. Bioinformatics. 24(18):2096-7.
Hu Z, Frith M, Niu T, Weng Z. (2003) SeqVISTA: a graphical tool for sequence feature visualization and comparison. BMC Bioinformatics. 4:1.
Hucka M, Le Novère N. (2010) Software that goes with the flow in systems biology. BMC Biol. 8:140.
Li P, Oinn T, Soiland S, Kell DB. (2008) Automated manipulation of systems biology models using libSBML within Taverna workflows. Bioinformatics. 24(2):287-9.
Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P. (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 20(17):3045-54.