on June 21, 2011 by Jochen Weile in Data integration, Tutorials, Comments (1)

The Ondex data integration back-end — Episode 1: The Ondex philosophy

Introduction

Ondex (10.1093/bioinformatics/btl081) is a graph-based semantic data integration framework, which specializes on the integration of biological data. The Ondex project was originally started by Jacob Koehler and his team at Rothamsted Research in 2004. It has since evolved into large collaboration project between Rothamsted Research, Manchester University and Newcastle University.

The Ondex system has two major components: the data integration back-end and the visualisation front-end. This document is intended to be the first in a series to introduce users to the Ondex back-end.

In this first episode we will discuss the way in which Ondex represents knowledge and examine the back-end’s mechanism of performing data integration tasks. Building upon these basic notions, the next episode will explain how to use the Ondex back-end and how to perform large-scale data integration tasks.

Knowledge representation

Graphs, concepts and relations

Ondex stores biological knowledge by representing it as an enriched network. This network primarily consists of graph nodes, called ‘concepts’ and edges that connect them, called ‘relations’. Each concept and each relation is bound to be of a certain type, called ‘concept class’ for concepts and ‘relation type’ for relations.
For example, imagine a concept of type ‘Gene‘ being connected via a relation of type ‘encodes‘ to another concept of type ‘Protein‘. These concept classes and relation types are part of a controlled set of terms.

Provenance and evidence

Ondex features additional controlled sets of terms for provenance and evidence information, called ‘data sources’ and ‘evidence types’.

‘Data sources’ describe the origin of a data point. Every concept must be associated with a ‘data source’. For example, a ‘Protein‘ concept could have been imported from UniprotKB(10.1093/nar/gkq1020), which would be indicated by assigning the ‘UniprotKB‘ data source element to the concept.

‘Evidence types’ on the other hand describe the evidence that supports the information represented by a concept or relation. Thus, all concepts and relations carry at least one evidence type reference. For example, if a relation has been determined using an accession mapping method, it would have the ‘accession_based_mapping‘ evidence type attached.

Attributes

Concepts and relations can carry further attributes. Generic attributes allow the attachment of any kind of data (e.g. numbers, images, etc.) to concepts or relations. All attributes are identified by ‘attribute names’ which are again part of a controlled set. These attribute names not only provide a context for an attribute, but also determine the data type of its value. For example a relation ‘has_similar_sequence‘ between two protein concepts might carry an attribute with a value of 0.00001 under the attribute name ‘e-value‘ which determines its data type as floating point number.

In addition to generic attributes, there are also certain special attributes for Ondex concepts. Cross references to external databases are stored in so-called ‘concept accession’ objects. These cross references usually point to associated database entries. This may be the entry in the database from which the concept was originally imported or any other related entries from other databases. Concept accessions consist of two parts: a ‘Data source’ object specifying the database plus the actual entry identifier.
Our example protein concept from UniProtKB could carry a concept accession pointing to its equivalent UniProtKB entry.

Another special attribute is the ‘concept name’. A concept can have more than one ‘concept name’, allowing synonyms to be added. A boolean flag can be used to make a ‘concept name’ a preferred name. This is mainly used for visualisation purposes in Ondex’s front-end. The example protein concept could have the preferred name ‘Cdc13p’ and the synonym ‘YDL220C’.

The Ondex network data structure

Illustration of the Ondex network data structure. A) Two concepts with various attributes connected by a relation. B) Part of a concept class hierarchy.

Metadata

As mentioned above, concept classes, relation types, data sources, evidence types and attribute names are all part of controlled vocabularies. These vocabularies are referred to as the Ondex graph’s metadata. A special role among the metadata is taken by the concept classes and relation types. These two types of metadata are not simple collections of terms, but form hierarchies and are thus capable of adding more semantic depth to their members. For example the concept class Protein can be defined as a specialisation of the concept class Molecule.

Graph implementations

Ondex allows for several different ways in which knowledge networks are stored. The most commonly used Ondex network implementation is the ‘Memory’ implementation. This is the simplest implementation of an Ondex graph. The complete data structure is held in the computer’s RAM. It works reasonably well for data sets of up to approximately 500,000 concepts and relations.

Alternatively, Ondex also offers a persistent implementation. This implementation holds the Ondex network inside a BerkeleyDB(http://portal.acm.org/citation.cfm?id=1268708.1268751) database. Objects are only held temporarily in RAM when they are accessed in order to minimize memory consumption. Any changes made to objects are fed back into the database. This implementation allows the construction of much bigger graphs, but causes a lot of I/O traffic, making it very slow.

New, experimental implementations are currently under development, including a graph that keeps the essential graph structure in memory while storing all secondary data in an SQL database, thus combining the advantages of the memory and persistent implementation. Another graph implementation currently under development is the RDF graph, which will keep the data in a triplestore.

Workflows

We have discussed how Ondex represents and stores data as networks. This leads to the question in which way Ondex handles the creation and manipulation of these networks. Ondex follows a workflow paradigm. We define workflows are a series of operations that are performed on an Ondex network. Workflows usually begin with the creation of an empty graph object. Subsequently, ‘workflow components’ are invoked to populate and manipulate the graph. There are several different types of workflow components:

  • Parsers: A parser creates contents for an Ondex graph, according to information in a set of files. It usually targets a specific database or file format. For example, there is a parser for the KEGG(10.1093/nar/28.1.27) database, as well as a parser for SBML(10.1093/bioinformatics/btg015) files.
  • Mapping: Mappings create relations between different parts of an Ondex graph. For example, the ‘accession-based’ mapping usually maps concepts from different data sources to one another. The ‘blast-based’ mapping creates relations between sequence-bearing concepts according to their similarity.
  • Filters: A filter extracts a subset from the graph. For example the all-pairs shortest path filter yields all concepts and the subset of relations that are part of shortest paths between any two concepts.
  • Transformers: A transformer is a workflow component that transforms the graph from one configuration into another. For example: The relation collapse transformer merges concepts that are connected by a certain relation.
  • Statistics: This special kind of workflow component runs statistical analyses on the graph and outputs them to the file system. For example: The GOA quality statistic outputs a specificity analysis over GeneOntology(10.1038/75556) annotations.
  • Exporters: Exporters can be used to write the contents of an Ondex graph to the filesystem. Most export operations are lossy, since not all file formats can represent all the data in Ondex. The only lossless export function is to the Ondex Exchange Language (OXL)(10.2390/biecoll-jib-2007-62) format.

Workflow components can be bundled into plug-ins for the Ondex workflow engine. These plug-ins can then be distributed separately. An extensive list of plug-ins and the contained workflow components can be found on ondex.org.

Upcoming episodes

The next episode will explain how to use the Ondex back-end for large-scale data integration tasks. Subsequently we will discuss topics relevant for Ondex developers, an extensive description of Ondex’s graph and workflow APIs.
s

Bibliography

1 Comment

  1. Introducing the Bioinformatics Knowledgeblog | Knowledge Blog

    June 28, 2011 @ 3:34 pm

    […] of these articles are tutorials, covering everything from large data integration suites like Ondex and cloud/grid computing infrastructure to metabolic modelling and some of the many facets of […]

Leave a comment

Login