on June 21, 2011 by Tim Booth in Bioinformatics APIs, Tutorials, Comments (0)

# Making use of JSON – a primer for small-time scripters

Here I’ll introduce the JSON data representation format, with a contrived situation where it might be useful and some examples of how to create and load JSON data in various languages. The article assumes you are familiar with a scripting language like Perl or Python and have access to a Linux or Unix shell with standard tools available.

## An example scenario

(Bio-)informaticians often find themselves needing to manipulate relatively small amounts of structured data. By relatively small, I mean that the computational time and memory requirements of processing the data are not problematic but manual manipulation of the results would be onerous.

Consider the following results, provided in Excel format. They represent the results of a toxicogenomic experiment which has used targeted digital transcriptomics to analyse 120 samples. Specimens of two organisms have been treated with various concentrations of a toxin. The primary data has been analysed to produce counts of genes of interest and compiled into a single spreadsheet report – the records for just the first two samples are shown.

 Sample 1 Species Lumbricus rubellus Tissue gut Toxin (µg/l) 0 Gene Count AVPR1A 128 CLTCL1 229 DAO 1270 FMO4 16 GABRA4 43 GRM2 310 KCNJ6 211 PIP5K2A 15 Sample 2 Species Lumbricus rubellus Tissue gut Toxin (µg/l) 10 Gene Count AVPR1A 218 CLTCL1 99 DAO 1240 FMO4 26 GABRA4 1043 GRM2 215 KCNJ6 217 PIP5K2A 31

Say I’d like to perform some further analysis of this data, which turns out to be tricky to accomplish within Excel or else would require manual repetition over every sample. Therefore I’ll convert the spreadsheet to CSV and make a script in my favourite scripting language to parse this and load the data into memory – for example in Perl:

[sourcecode language=”perl”]
#!/usr/bin/perl
#parse_samples.perl – created Tue Jun 21 12:33:56 BST 2011
use strict;
use warnings;
use Data::Dumper;
my @samples;
my $sample_number; my$listing_genes = 0;
for(<>)
{
chomp;
my @line = split ‘,’, $_; next unless$line[0];
#Split out sample number
if($line[0] =~ /Sample (\d+)/) {$sample_number = $1 – 1;$listing_genes = 0;
next;
}
if($line[0] eq ‘Gene’) {$listing_genes = 1;
next;
}
#Add data into Perl data structure
if(!$listing_genes) { #Trim anything after spaces$line[0] =~ s/ .*//;
$samples[$sample_number]->{$line[0]} =$line[1];
}
else
{
$samples[$sample_number]->{Genes}->{$line[0]} =$line[1];
}
}

#Quick data serialisation
print Dumper \@samples;

[/sourcecode]

The script above will load the data into a Perl hierarchical data structure (don’t worry if you don’t understand the details, but do take a look at the output). I can then go on to, say, calculate normalised transcript abundances, or to invoke a web service load Gene Ontology annotations for all the genes, etc. In order to see my intermediate results, Perl allows me to dump out the array from memory using the Data::Dumper feature.

[sourcecode language=”perl”]
\$VAR1 = [
{
‘Species’ => ‘Lumbricus rubellus’,
‘Genes’ => {
‘CLTCL1’ => ‘229’,
‘AVPR1A’ => ‘128’,
‘GRM2’ => ‘310’,
‘KCNJ6’ => ‘211’,
‘FMO4′ => ’16’,
‘DAO’ => ‘1270’,
‘GABRA4′ => ’43’,
‘PIP5K2A’ => ’15’
},
‘Toxin’ => ‘0’,
‘Tissue’ => ‘gut’
},
{
‘Species’ => ‘Lumbricus rubellus’,
‘Genes’ => {
‘CLTCL1′ => ’99’,
‘AVPR1A’ => ‘218’,
‘GRM2’ => ‘215’,
‘KCNJ6’ => ‘217’,
‘FMO4′ => ’26’,
‘DAO’ => ‘1240’,
‘GABRA4’ => ‘1043’,
‘PIP5K2A’ => ’31’
},
‘Toxin’ => ’10’,
‘Tissue’ => ‘gut’
}
];

[/sourcecode]

The Dumper() output is fairly easy for a human to read, and can easily be reloaded into Perl, so I could now save the data in this format and load it back up again if I need to do further analysis. The problem is that only Perl is able to easily re-read this format. How can I save the results in a language-agnostic way without incurring extra hassle? Outputting the data as XML might be one option, but there is no standardised way of writing data structures like this to XML and in any case reading XML by eye is hard work. It turns out that the “JSON” module for Perl solves this problem with just a two-line change to the code.

## Making it re-usable

[sourcecode language=”perl”]
#At the top
use JSON;
# …
#At the end
print to_json(\@samples, {pretty=>1});

[/sourcecode]

And the result looks like this:

[sourcecode language=”javascript”]
[
{
"Species" : "Lumbricus rubellus",
"Genes" : {
"CLTCL1" : "229",
"AVPR1A" : "128",
"GRM2" : "310",
"KCNJ6" : "211",
"FMO4" : "16",
"DAO" : "1270",
"GABRA4" : "43",
"PIP5K2A" : "15"
},
"Toxin" : "0",
"Tissue" : "gut"
},
{
"Species" : "Lumbricus rubellus",
"Genes" : {
"CLTCL1" : "99",
"AVPR1A" : "218",
"GRM2" : "215",
"KCNJ6" : "217",
"FMO4" : "26",
"DAO" : "1240",
"GABRA4" : "1043",
"PIP5K2A" : "31"
},
"Toxin" : "10",
"Tissue" : "gut"
}
]

[/sourcecode]

The JSON output is actually very similar to the native Perl output. As with the Perl output I have no need to define a schema or modify the data in any way before exporting. JSON stands for “ JavaScript Object Notation” but despite having origins in JavaScript it is supported by most modern programming languages. For example, if somebody wanted to work on this data in Python.

[sourcecode language=”python”]
#!/usr/bin/env python
import sys
import json
print data

[/sourcecode]

Or in R

[sourcecode language=”r”]
> install.packages("rjson")
> library(rjson)
> mydata <- fromJSON( , "data.json" )

[/sourcecode]

The mydata variable now contains the data ready to be manipulated in R.

There are also on-line tools that can load and format JSON structured data – eg. www.jsonviewer .com. In fact, an original purpose of the JSON format was to allow dynamic web content, where a JSON data structure is formatted into a web page using embedded JavaScript. This method is potentially neater than parsing out to static HTML, simpler than using an XML+XSLT combination and also provides more scope for dynamic interaction with the data on the page through JavaScript callbacks.

Finally, JSON is fairly amenable to standard Unix tools like grep and sed. For example, I might want to get back to a spreadsheet format similar to the one I started with. In this case I can simply do this:

[sourcecode language=”sh”]
tr -d ‘{}[],:’ < data.json > data.csv
[/sourcecode]

The resulting data.csv can be imported into Excel or OpenOffice Calc by setting the field delimiter to space and selecting “merge delimiters” on the import dialogue. Thus I have an easy way to send the results of my processing to non-coders who prefer to work in a spreadsheet.