on July 5, 2011 by Mikel Egana Aranguren in Data integration, Comments Off on Publishing information in the Linked Open Data cloud
Publishing information in the Linked Open Data cloud
Linked Data is a method for publishing structured data on the internet using current web technology. The core idea of Linked Data is to be able to use the web as giant Data Base rather than as a collection of documents. This means that we can query the Linked Data network, integrating information from different resources, and build interesting applications that exploit such information. Linked Data is part of a broader endeavour, the Semantic Web, a vision for the next generation web in which information will be provided with precise semantics, allowing computers to perform automated reasoning on it. This will enable computers to perform “Intelligent” tasks on their own.
The use of Linked Data is growing in bioinformatics, since it offers a light-weight solution to the perennial problem of data integration in life sciences. This tutorial is a very brief overview of how to publish data following the Linked Data method.
What is Linked Data?
Two key features of Linked Data stand out: we publish our data in a structured format, and the data we publish can be linked to any other data on the web. These two ideas were also fundamental for the birth of the World Wide Web (WWW): we publish our information in a standard format, HTML web pages, and anyone can link web pages to other web pages (Hyper Links). The only difference between the WWW and Linked Data is that instead of applying such principles to textual information, we apply them to raw data. As it happened with the web, where the network of web pages started to grow “organically”, the network of Linked Data is growing in the same fashion, with no central authority, as new agents link data to the existing data. The network of world wide open data is called the Linked Open Data (LOD) cloud.
In more technical terms, Linked Data follows four principles:
- Use URIs as names for things.
- Use HTTP URIs so that people can look up (Dereference) those names.
- When someone looks up a URI, provide useful information, using the standards.
- Include links to other URIs, so that they can discover more things.
The third principle mentions that useful information should be provided using the standards. The standards that is referring to are RDF, SPARQL and OWL.
RDF (Resource Description Framework) is a language for providing structured information. The structure is provided by codifying the information in triples that follow the subject-predicate-object pattern: the predicate is a property of the subject whose value is the object. A lot of information can be represented using such pattern, for example: Rome is part of Italy, CYCB interacts with CYCA, etc. Triples from different resources can be combined, since the subject of one triple can be the object of another triple. By combining different triples we obtain an RDF graph. An RDF graph can be queried using SPARQL. RDF entities (Subjects, Predicates, Objects) are identified by URIs. Later, when served to the user in a Linked Data setting, the appropriate representation is provided by performing content negotiation: HTML, RDF(XML), JSON, etc. It is key to understand that the main URI is a URI given to an entity, and that the entity has different representations. However, Linked Data is about linking entities, not their representations (Documents).
Another key technology for linked data is the Web Ontology Language, OWL. OWL is used to build ontologies; an ontology is a formalisation, using logical axioms, of a knowledge domain. In the case of Linked Data, OWL is used to provide a vocabulary that will be used to publish the data. For example, if we are publishing a dataset that contains proteins and interactions between proteins, the concept “Protein” will be an OWL class. Such class will hold axiomatic definitions (General properties of all the proteins in this case) like Protein subClassOf is_codified_by only gene. Concrete proteins (e.g. CyCB) will be instances of such class (CyCb owl:Types Protein). Also, the properties used to link instances in pairs (eg. interacts_with) will be defined in the OWL ontology, including their features (e.g. transitivity, domain and ranges, etc.). OWL makes it possible to perform automated reasoning on the instances and the classes, for example to make information that was implicit in the data explicit to the user.
OWL and RDF are part of the Semantic Web Stack. In such stack the technologies that sit in higher positions (OWL sits on top of RDF and RDF sits on top of XML) extend the semantic expressivity of the layers bellow. For example, in rough terms, RDF adds the triple model to XML: an RDF file can be regarded as a plain XML file or the RDF semantics can be taken into account. At the same time, OWL adds a whole range of expressivity on top of RDF: OWL (RDF/XML) can be parsed as OWL or as a collection of RDF triples, without taking into account OWL semantics. This is done like that to ensure the gradual adoption of the semantic web, keeping the semantically most expressive layers syntactically compatible with the less espressive layers.
Publishing information as Linked Data
The general steps for publishing information in the LOD are explained as follows. These are very general guidelines, written in a nearly but not completely sequential manner.
Considering what to publish
Consider, obviously, what you are going to publish. You should publish information absent from the LOD cloud and if RDF redundant triples are needed (e.g. For SPARQL querying) limit them to the minimum necessary. Such triples should also be flagged as equivalent to the information already extant in the LOD cloud (See “Adding links to the RDF dataset” bellow).
Choosing an URI scheme
URIs are the base for providing useful Linked Open Data, so you should carefully think the URI scheme you will follow for your entities. It is usually a good idea to separate the ontology from the actual data instances, for example the geo.linkeddata.es project follows this scheme:
- http://geo.linkeddata.es/ontology/ClassName (for Concepts).
- http://geo.linkeddata.es/ontology/property (for Properties).
- http://geo.linkeddata.es/resource/InstanceName (for data instances).
Also, the dereferencing method (How to serve resources after the consumer of the information has requested them via HTTP) should be decided: 303 redirection or hash URIs.
Building an ontology
As mentioned, the entities of RDF triples are published as instances of the classes of one or various ontologies. Such ontologies can be built from scratch, but it is preferable to reuse already existing ontologies, to improve information integration. If we are unable to find an ontology that describes our domain, it is recommended to extend an already existing ontology instead of building a new one. If that is not possible, an ontology can be built, for example using Protégé.
Converting original data sources into an RDF dataset
When publishing Linked Data the most common situation is to publish the information already stored in a relational Data Base or other formats such as spreadsheets as Linked Data. Thus, Linked Data is like a “View” on the original information stored in other formats/systems. In order to convert such information to RDF specific software is needed: D2R, D2RQ, R2O and ODEMAPSTER, NOR2O, Rightfield, etc.
Storing the RDF dataset
Making the RDF dataset public
The dataset should be served using different interfaces:
- Pure RDF, in order for owners of other datasets to link to our dataset.
- HTML, with Pubby.
- SPARQL endpoint, in order for users or automatic agents (Applications) to perform queries against our dataset.
Additionally, the dataset should be easy to discover. This is done in two ways:
- The dataset should be uploaded to a catalog like CKAN. By uploading it to CKAN, our dataset will appear in the LOD cloud diagram (If our dataset meets the requirements).
- Search engines should be aware of updates on our datasets. This is done by producing a set of sitemap.xml files from our SPARQL endpoint, using software like Sitemap4RDF. Services like Sindice should be then notified.
Adding links to the RDF dataset
Adding links to other datasets is paramount to add value to our dataset. Such links can be of two kinds:
- Instance level links: subjects and objects are linked. Such links can be custom predicates that exploit properties already formalised in the ontology, like located-in (e.g. CyCb located-in nucleus) or owl:sameAs axioms stating that two instances are in fact the same entity (e.g. MyDataset:CyCB owl:sameAs UniProt:CyCb). Instance level links can be set manually or by using a link discovery service such as Silk.
- Vocabulary level links: OWL classes and properties are linked, using different OWL axioms: owl:equivalentTo, owl:DisjointFrom, owl:SubClassOf, etc. Vocabulary level links are more scarce than instance level links.
It is also interesting to ask other dataset owners to link to our dataset, in order to maximise the connectivity of all the datasets.
A bit of automated reasoning goes a long way
OWL automated reasoning does not scale well yet but its performance is improving fast. With regards to Linked Data, automated reasoning is starting to be applied in differente ways:
- Materialise triples. All the triples that are implicit in the data can be made explicit by the reasoner. Such “new” triples can be serialised (“Written down”) before uploading the dataset, so that SPARQL querying can be exploited against them.
- Consistency checks. The reasoner can be used to check the consistency of the RDF data. Additionally, for example Pellet ICV can be used to treat OWL with the closed world assumption, ie to use the ontology as a constraint checker.