on June 21, 2011 by Jan Aerts in Bioinformatics APIs, Comments (2)

# Accessing Ensembl using Ruby – the Core database

This tutorial describes how to use the Ruby API to the Ensembl Core database. It is part of a collection of 3 tutorials on the Ruby API to Ensembl. Please see this introductory tutorial for more information on the Ruby API system, installation and a minimal script.

# Slices

A Slice object represents a single continuous region of a genome. Slices can be used to obtain sequence, features or other information from a particular region of interest. There are several ways to obtain a slice, but we will start with the Ensembl::Core::Slice#fetch_by_region method which is the most commonly used. This class method takes numerous arguments but most of them are optional. In order, the arguments are: coord_system_name, seq_region_name, start, end, strand, coord_system_version. The following are several examples of how to use the Ensembl::Core::Slice#fetch_by_region method:

Obtain a slice covering the entire chromosome X

[sourcecode language=”ruby”] slice = Slice.fetch_by_region(‘chromosome’, ‘X’)[/sourcecode]

Obtain a slice covering the entire clone AL359765.6

[sourcecode language=”ruby”] slice = Slice.fetch_by_region(‘clone’, ‘AL359765.6’)[/sourcecode]

Obtain a slice covering an entire NT contig

[sourcecode language=”ruby”] slice = Slice.fetch_by_region(‘supercontig’, ‘NT_011333’)[/sourcecode]

Obtain a slice covering the region from 1MB to 2MB (inclusively) of chromosome 20

[sourcecode language=”ruby”] slice = Slice.fetch_by_region(‘chromosome’, ’20’, 1000000, 2000000)[/sourcecode]

Another useful way to obtain a slice is with respect to a gene, e.g. with 5kb flanking sequence:

[sourcecode language=”ruby”] slice = Slice.fetch_by_gene_stable_id(‘ENSG00000099889’, 5000)[/sourcecode]

This will return a slice that contains the sequence of the gene specified by its stable Ensembl ID. It also returns 5000bp of flanking sequence at both the 5’ and 3’ ends, which is useful if you are interested in the environs that a gene inhabits. You needn’t have the flanking sequence it you don’t want it — in this case set the number of flanking bases to zero or simply omit the second argument entirely. Note that the fetch_by_gene_stable_id() method always returns a slice on the forward strand even if the gene is on the reverse strand.

To retrieve a set of slices from a particular coordinate system the fetch_all method can be used:

Retrieve slices of every chromosome in the database

[sourcecode language=”ruby”] slices = Slice.fetch_all(‘chromosome’)[/sourcecode]

Retrieve slices of every BAC clone in the database

[sourcecode language=”ruby”] slices = Slice.fetch_all(‘clone’)[/sourcecode]

For certain types of analysis it is necessary to break up regions into smaller manageable pieces. The method Slice#split can be used to break up larger slices into smaller component slices. The following code creates an array of subslices of chromosome 1, with the (maximal) length of each slice 100000 bp and an overlap of 250 bp.

[sourcecode language=”ruby”] big_slice = Slice.fetch_by_region(‘chromosome’, 1)
subslices = big_slice.split(100000, 250)[/sourcecode]

To obtain sequence from a slice the Slice#seq method can be used:

[sourcecode language=”ruby”] seq = slice.seq
puts seq[/sourcecode]

We can query the slice for information about itself:

[sourcecode language=”ruby”] seq_region = slice.seq_region.name
coord_system = slice.seq_region.coord_system.name
start = slice.start
stop = slice.stop
strand = slice.strand</p>
puts "Slice: #{coord_system} #{seq_region} #{start}-#{stop} (#{strand})"[/sourcecode]

Many classes can provide a set of features which overlap a slice. The slice itself also provides a means to obtain features which overlap its region. To obtain a list of genes which overlap a slice:

[sourcecode language=”ruby”] slice_a = Slice.fetch_by_region(‘chromosome’,’X’)
genes = slice_a.genes[/sourcecode]

CAUTION: The slice concept is a little bit different from that in the perl API! If you ask a gene for its slice using the perl API, you get a slice covering the whole of the chromosome. In contrast, the slice created by the ruby API only contains that bit covered by the gene. The Ruby slice concept therefore matches the feature_Slice concept in the Perl API. The Ensembl::Core::SeqRegion class is used to refer to whole things.

# Features

Features are objects in the database which have a defined location on the genome. All features in Ensembl include the Ensembl::Core::Sliceable mixin and have the following location defining attributes: start, end, strand, slice.

All feature objects can be retrieved using their #find method of their class or any of the generic #find_by_() methods (see the ActiveRecord bit of this tutorial). The following example illustrates how Transcript features and DnaDnaAlignFeature features can be obtained from the database. All features in the database can be retrieved in similar ways from their own object adaptors.

Get a slice of chromosome 20, 10MB-11MB

[sourcecode language=”ruby”] slice = Slice.fetch_by_region(‘chromosome’, ’20’, 10000000, 11000000 )[/sourcecode]

Fetch all of the transcripts overlapping chromosome 20, 10MB-11MB

[sourcecode language=”ruby”] transcripts = slice.transcripts
transcripts.each do |transcript|
name = transcript.stable_id
internal_id = transcript.id
start = transcript.start
stop = transcript.stop
strand = transcript.strand
puts "Transcript #{name} [#{internal_id}] #{start}-#{stop} (#{strand})"
end[/sourcecode]

Fetch all of the DNA-DNA alignments overlapping chromosome 20, 10MB-11MB

[sourcecode language=”ruby”] dafs = slice.dna_align_features
dafs.each do |daf|
name = daf.hit_name
internal_id = daf.id
start = daf.start
stop = daf.stop
strand = daf.strand
puts "DNA alignment #{name} [#{internal_id}] #{start}-#{stop} (#{strand})"
end[/sourcecode]

Fetch a transcript by its internal identifier

[sourcecode language=”ruby”] transcript = Transcript.find(100)[/sourcecode]

Fetch a DnaAlignFeature by its internal identifiers

[sourcecode language=”ruby”] daf = DnaAlignFeature.find(100)[/sourcecode]

All features also have the transform method which are described in detail in a later section of this tutorial.

## Features across coordinate systems

In the Ensembl database, some features might be related to one coordinate system, while other features are related to another one (for more information on coordinate systems, see below). For example, there are three coordinate systems in cow: contigs, scaffolds and chromosomes. Scaffold Chr4.003.122 does not have any simple_features on it. However, the equivalent regions in the contig and chromosome coordinate systems have 37 and 85 (=total of 122), respectively. If you therefore ask that scaffold to list its simple_features, you wouldn’t get any. A workaround for this, is to first create a slice for this scaffold, and ask that slice for its simple_features.

[sourcecode language=”ruby”] scaffold = SeqRegion.find_by_name(‘Chr4.003.122’)
puts scaffold.simple_features.length #–> 0
slice = Slice.fetch_by_region(‘scaffold’,’Chr4.003.122′)
puts slice.simple_features.length #–> 122[/sourcecode]

or even:

[sourcecode language=”ruby”] puts scaffold.slice.simple_features.length #–> 122[/sourcecode]

The reason this works, is that any retrieval for a slice also checks what coordinate systems that type of feature is annotated on.

## Genes, Transcripts, and Exons

Genes, exons and transcripts are also features and can be treated in the same way as any other feature within Ensembl. A transcript in Ensembl is a grouping of exons. A gene in Ensembl is a grouping of transcripts which share any overlapping (or partially overlapping) exons. Transcripts also have an associated Translation object which defines the UTR and CDS composition of the transcript. Introns are not defined explicitly in the database but can be obtained by the Ensembl::Core::Transcript#introns method.

Important: like all Ensembl features the start of an exon is always less than or equal to the end of the exon, regardless of the strand it is on. The start of the transcript is the start of the first exon of a transcript on the forward strand or the end of the last exon of a transcript on the reverse strand. The start and end of a gene are defined to be the lowest start value of its transcripts and the highest end value respectively.

Genes, translations, transcripts and exons all have stable identifiers. These are identifiers that are assigned to Ensembl’s predictions, and maintained in subsequent releases. For example, if a transcript (or a sufficiently similar transcript) is re-predicted in a future release then it will be assigned the same stable identifier as its predecessor.

The following is an example of the retrieval of a set of genes, transcripts and exons:

[sourcecode language=”ruby”] slice = Slice.fetch_by_region(‘chromosome’,’X’,1000000,10000000)
puts slice.display_name
slice.genes.each do |gene|
puts "\t" + gene.stable_id
gene.transcripts.each do |transcript|
puts "\t\t" + transcript.stable_id
transcript.exons.each do |exon|
puts "\t\t\t" + exon.id.to_s
end
end
end[/sourcecode]

In addition to the methods which are present on every feature, the transcript class has many other methods which are commonly used. Several methods can be used to obtain transcript related sequences. At the time of writing this tutorial, these methods return strings rather than bioruby Bio::Sequence objects. The following example demonstrates the use of some of these methods:

The Ensembl::Core::Transcript#seq method returns the concatenation of the exon sequences. This is the cDNA of the transcript:

[sourcecode language=”ruby”] puts "cDNA: " + transcript.seq[/sourcecode]

The Ensembl::Core::Transcript#cds_seq method returns only the CDS of the transcript

[sourcecode language=”ruby”] puts "CDS: " + transcript.cds_seq[/sourcecode]

UTR sequences are obtained via the five_prime_utr_seq and three_prime_utr_seq methods

[sourcecode language=”ruby”] fiv_utr = transcript.five_prime_utr_seq
thr_utr = transcript.three_prime_utr_seq</p>
puts "5′ UTR: " + ( fiv_utr.nil? ? ‘None’ : fiv_utr )
puts "3′ UTR: " + ( thr_utr.nil? ? ‘None’ : thr_utr )[/sourcecode]

The peptide sequence is obtained from the Ensembl::Core::Transcript#protein_seq method. If the transcript is non-coding, undef is returned.

[sourcecode language=”ruby”] peptide = transcript.protein_seq
puts "Translation: " + ( peptide.nil? ? ‘None’ : peptide )[/sourcecode]

## Translations and ProteinFeatures

Translation objects and peptide sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudo-genes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object not a Translation object as might be expected. The following example obtains the peptide sequence of a Ensembl::Core::Transcript and the Ensembl::Core::Translation’s stable identifier:

[sourcecode language=”ruby”] stable_id = ‘ENST00000044768′</p>
transcript = Transcript.find_by_stable_id(stable_id)
puts transcript.stable_id
puts transcript.translation.stable_id[/sourcecode]

## PredictionTranscripts

PredictionTranscripts are the results of ab initio gene finding programs that are stored in Ensembl. Example programs include Genscan and SNAP. Prediction transcripts have the same interface as normal transcripts and thus they can be used in the same way.

[sourcecode language=”ruby”] prediction_transcripts = slice.prediction_transcripts
prediction_transcripts.each do |pt|
exons = pt.prediction_exons
type = pt.analysis.logic_name
puts "#{type} prediction has #{exons.length.to_s} exons"
exons.each do |exon|
puts exon.to_yaml
end
end[/sourcecode]

## Alignment Features

Two types of alignments are stored in the core Ensembl database: alignments of DNA sequence to the genome and alignments of peptide sequence to the genome. These can be retrieved as Ensembl::Core::DnaAlignFeatures and Ensembl::Core::ProteinAlignFeatures respectively. A single gapped alignment is represented by a single feature with a cigar line. A cigar line is a compact representation of a gapped alignment as single string containing letters M (match) D (deletion), and I (insertion) prefixed by integer lengths (the number may be omitted if it is 1). The following example shows the retrieval of some alignment features.

Retrieve dna-dna alignment features from the slice region

[sourcecode language=”ruby”] features = slice.dna_align_features(‘Vertrna’)
features.each do |f|
puts f.to_yaml
end[/sourcecode]

Retrieve protein-dna alignment features from the slice region

[sourcecode language=”ruby”] features = slice.protein_align_features(‘Swall’)
features.each do |f|
puts f.to_yaml
end[/sourcecode]

## Repeats

Repetitive regions found by RepeatMasker and TRF (Tandem Repeat Finder) are represented in the Ensembl database as RepeatFeatures. Short non-repetitive regions between repeats are found by the program Dust and are also stored as RepeatFeatures. RepeatFeatures can be retrieved and used in the same way as other Ensembl features.

[sourcecode language=”ruby”] repeats = slice.repeats
repeats.each do |r|
puts r.display_id + "\t" + repeat.start.to_s + "\t" + repeat.stop.to_s
end[/sourcecode]

## Markers

Markers are imported into the Ensembl database from UniSTS and several other sources. A marker in Ensembl consists of a pair of primer sequences, an expected product size and a set of associated identifiers known as synonyms. Markers are placed on the genome electronically using an analysis program such as ePCR and their genomic positions are retrievable as MarkerFeatures. Map locations (genetic, radiation hybrid and in situ hybridization) for markers obtained from actual experimental evidence are also accessible.

Markers can be fetched by their name. The Marker#find_all_by_name returns an array, and Marker#find_by_name returns the first element of that array, i.e. a marker object.

[sourcecode language=”ruby”] marker = Marker.find_by_name(‘D9S1038E’)[/sourcecode]

Display the various names associated with the same marker

[sourcecode language=”ruby”] marker.marker_synonyms.each do |ms|
if ms.source.nil?
puts ms.name
else
puts ms.source + ‘:’ + ms.name
end
end[/sourcecode]

Display the primer info

[sourcecode language=”ruby”] puts "left primer: " + marker.left_primer.to_s
puts "right primer: " + marker.right_primer.to_s
puts "product size: " + marker.min_primer_dist.to_s + ‘-‘ + marker.max_primer_dist.to_s[/sourcecode]

Display out genetic/RH/FISH map information

[sourcecode language=”ruby”] puts "Map locations:"
marker.marker_map_locations.each do |mapping|
puts mapping.map.map_name + "\t" + mapping.chromosome_name + "\t" + mapping.position.to_s
end[/sourcecode]

MarkerFeatures, which represent genomic positions of markers, can be retrieved and manipulated in the same way as other Ensembl features.

Obtain the positions for an already retrieved marker

[sourcecode language=”ruby”] marker.marker_features.each do |mf|
puts mf.slice.display_name
end[/sourcecode]

Retrieve all marker features in a given region

[sourcecode language=”ruby”] marker_features = slice.marker_features
marker_features.each do |mf|
puts mf.slice.display_name
end[/sourcecode]

## MiscFeatures

MiscFeatures are features with arbitrary attributes which are placed into arbitrary groupings. MiscFeatures can be retrieved as any other feature and are classified into distinct sets by a set code. Generally it only makes sense to retrieve all features which have a particular set code because very diverse types of MiscFeatures are stored in the database.

MiscFeature attributes are represented by Attribute objects and can be retrieved via a get_all_Attributes() method.

The following example retrieves all MiscFeatures representing ENCODE regions on a given slice and prints out their attributes:

[sourcecode language=”ruby”] encode_regions = slice.misc_features(‘encode’)
encode_regions.each do |er|
attributes = er.misc_attribs
attributes.each do |a|
puts a.to_s
end
end[/sourcecode]

This example retrieves all misc features representing a BAC clone via its name and prints out their location and other information:

[sourcecode language=”ruby”] clones = MiscFeature.find_all_by_attribute_type_value(‘name’, ‘RP11-62N12’)
clones.each do |clone|
slice = clone.slice
puts slice.to_yaml
attributes = clone.misc_attribs
attributes.each do |a|
puts a.to_s
end
end[/sourcecode]

# External References

Ensembl cross references its genes, transcripts and translations with identifiers from other databases. A cross reference is referenced by a Xref object. The following code snippet retrieves and prints Xrefs for a gene, its transcripts and its translations:

Get the ‘COG6’ gene from human

[sourcecode language=”ruby”] cog6 = Gene.find_by_name(‘COG6’)
puts ‘GENE: ‘ + cog6.stable_id + " (internal id: " + cog6.id.to_s + ")"
cog6.xrefs.each do |x|
puts x.to_s
end
cog6.transcripts.each do |t|
puts "TRANSCRIPT: " + t.stable_id
t.xrefs.each do |x|
puts "\s\s" + x.to_s
end

# Watch out: pseudogenes have no translation
if ! t.translation.nil?
translation = t.translation
puts "\tTRANSLATION: " + translation.stable_id
translation.xrefs.each do |x|
puts "\t\s\s" + x.to_s
end
end
end[/sourcecode]

Often it is useful to obtain all of the Xrefs associated with a gene and its associated transcripts and translation as in the above example. As a shortcut to calling #xrefs on all of the above objects the Gene#all_xrefs method can be used instead. The above example could be shortened by using the following:

[sourcecode language=”ruby”] cog6.all_xrefs.each do |x|
puts x.to_s
end[/sourcecode]

This returns all xrefs for the gene itself, including those for all transcripts and translations.

# Coordinates

We have already discussed the fact that slices and features have coordinates, but we have not defined exactly what these coordinates mean.

Ensembl, and many other bioinformatics applications, use inclusive coordinates which start at 1. The first nucleotide of a DNA sequence is 1 and the first amino acid of a peptide sequence is also 1. The length of a sequence is defined as end - start + 1.

In some rare cases inserts are specified with a start which is one greater than the end. For example a feature with a start of 10 and an end of 9 would be a zero length feature between base pairs 9 and 10.

Slice coordinates are relative to the start of the underlying DNA sequence region (a Ensembl::Core::SeqRegion object). The strand of the slice represents its orientation relative to the default orientation of the sequence region. By convention the start of the slice is always less than the end, and does not vary with its strandedness. Most slices you will encounter will have a strand of 1, and this is what we will consider in our examples. It is legal to create a slice which extends past the boundaries of a sequence region.

## Coordinate Systems, Sequence Regions and Slices

Sequences stored in Ensembl are associated with coordinate systems. What the coordinate systems are varies from species to species. For example, the homo_sapiens database has the following coordinate systems: contig, clone, supercontig, chromosome. Sequence and features may be retrieved from any coordinate system despite the fact they are only stored internally in a single coordinate system. The database stores the relationship between these coordinate systems and the API provides means to convert between them. The API has a Ensembl::Core::CoordSystem object and object adaptor, however, these are most often used internally. The following example fetches a chromosome coordinate system object from the database:

[sourcecode language=”ruby”] chr_coord_system = CoordSystem.find_by_name(‘chromosome’)
puts "Coordinate system: " + chr_coord_system.name + ":" + chr_coord_system.version[/sourcecode]

A coordinate system is uniquely defined by its name and version. Most coordinate systems do not have a version, and the ones that do have a default version, so it is usually sufficient to use only the name when requesting a coordinate system. For example, chromosome coordinate systems have a version which is the assembly that defined the construction of the coordinate system. The version of the human chromosome coordinate system might be something like NCBI35 or NCBI36, depending on the version of the Core databases used.

Ensembl::Core::SeqRegion objects have an associated Ensembl::Core::CoordSystem object and a #name method that returns its name which uniquely defines them. You may have noticed that the coordinate system of the sequence region was specified when obtaining a slice in the #fetch_by_region method. Similarly the version may also be specified (though it can almost always be omitted):

[sourcecode language=”ruby”] slice = Slice.fetch_by_region(‘chromosome’, ‘X’, 1000000, 10000000, ‘NCBI36’)[/sourcecode]

To obtain all sequence regions for a given coordinate system, just call the Ensembl::Core::CoordSystem#seq_regions method.

[sourcecode language=”ruby”] coord_system = CoordSystem.find_by_name(‘chromosome’)
chromomsomes = coord_system.seq_regions
chromosomes.each do |chr|
puts chr.name
end[/sourcecode]

Sometimes it is useful to obtain full slices of every sequence region in a given coordinate system; this may be done using the Slice#fetch_all method:

[sourcecode language=”ruby”] chromosomes = Slice.fetch_all(‘chromosome’)
clones = Slice.fetch_all(‘clone’)[/sourcecode]

Now suppose that you wish to write code which is independent of the species used. Not all species have the same coordinate systems; the available coordinate systems depends on the style of assembly used for that species (WGS, clone-based, etc.). You can obtain the list of available coordinate systems for a species using the Ensembl::Core::CoordSystem#find(:all) method and there is also a special pseudo-coordinate system named toplevel. The toplevel coordinate system is not a real coordinate system, but is used to refer to the highest level coordinate system in a given region. The toplevel coordinate system is particularly useful in genomes that are incompletely assembled. For example, the latest zebrafish genome consists of a set of assembled chromosomes, and a set of supercontigs that are not part of any chromosome. In this example, the toplevel coordinate system sometimes refers to the chromosome coordinate system and sometimes to the supercontig coordinate system depending on the region it is used in.

List all coordinate systems in this database:

[sourcecode language=”ruby”] coord_systems = CoordSystem.find(:all)
coord_systems.each do |coord_system|
puts coord_system.name + "\t" + coord_system.version
end[/sourcecode]

Get all slices on the highest coordinate system:

[sourcecode language=”ruby”] slices = Slice.fetch_all(‘top_level’)[/sourcecode]

## Transform

Features on a seq_region in a given coordinate system may be moved to another coordinate system. This is useful if you are working with a particular coordinate system but you are interested in obtaining the features coordinates in another coordinate system.

The Ensembl::Core::Sliceable#transform method (available to all features) can be used to move a feature to any coordinate system which is in the database. The feature will be a clone of the original feature, but with a different seq_region associated with it, as well as seq_region_start, seq_region_end and seq_region_strand.

[sourcecode language=”ruby”] #Suppose original_feature is on the ‘chromosome’ coordinate system
new_feature = original_feature.transform(‘clone’)
if new_feature.nil?
puts "Feature is not defined in clonal coordinate system"
else
puts "Feature’s clonal position:"
puts new_feature.seq_region.name
puts new_feature.seq_region_start.to_s + ".." + new_feature_seq_region_end
end[/sourcecode]

To print out the position of a feature (i.e. concatenating the seq_region name, start, end), it’s easier to create a slice of it first, and then calling the Ensembl::Core::Slice#display_name method:

[sourcecode language=”ruby”] puts new_feature.slice.display_name[/sourcecode]

The #transform method returns a copy of the original feature in the new coordinate system, or nil if the feature is not defined in that coordinate system. A feature is considered to be undefined in a coordinate system if it overlaps an undefined region or if it crosses a coordinate system boundary. Take for example the tiling path relationship between chromosome and contig coordinate systems:

                    |~~~~~~~| (Feature A) |~~~~| (Feature B)
(ctg 1) [=============]
(ctg 2) (------==========] (ctg 2)
(ctg 3)   (--============] (ctg3)

Both Feature A and Feature B are defined in the chromosomal coordinate system described by the tiling path of contigs. However, Feature A is not defined in the contig coordinate system because it spans both Contig 1 and Contig 2. Feature B, on the other hand, is still defined in the contig coordinate system.

The special toplevel coordinate system can also be used in this instance to move the feature to the highest possible coordinate system in a given region:

[sourcecode language=”ruby”] new_feature = original_feature.transform(‘toplevel’)
puts new_feature.slice.display_name[/sourcecode]

NOTE: In contrast to the perl API, there is no #transfer method.

## Project

When moving features between coordinate systems it is usually sufficient to use the Ensembl::Core::Sliceable#transform method. Sometimes, however, it is necessary to obtain coordinates in a another coordinate system even when a coordinate system boundary is crossed. Even though the feature is considered to be undefined in this case, the feature’s coordinates can still be obtained in the requested coordinate system using the Slice#project method.

While #transform is a method only available to features, both slices and features have their own #project methods, which take the same arguments and have the same return values. The #project method takes a coordinate system name as an argument and returns an array of Slice and Gap objects. The following example illustrates the use of the #project method on a slice. The #project method on a feature can be used in the same way. As with the feature #transform method the pseudo coordinate system toplevel can be used to indicate you wish to project to the highest possible level.

[sourcecode language=”ruby”] original_slice = Slice.fetch_by_region(‘chromosome’, ‘4’, 329500, 380000)
target_slices = @source_slice_contigs_with_strand.project(‘contig’)
target_slices.each do |ts|
puts ts.display_name
end[/sourcecode]

The above returns (for Bos taurus):

  contig::AAFC03092598:60948:61145:1
contig::AAFC03118261:25411:37082:1
contig::AAFC03092594:1:3622:-1
contig:gap:50
contig::AAFC03092597:820:35709:-1
contig::AAFC03032210:13347:13415:1