on June 21, 2011 by Keith Flanagan in Grid and Cloud Computing, Comments (2)

Microbase Responder Tutorial

Introduction

Microbase is a distributed computing platform that can be used to execute computationally intensive work in parallel, either on a single machine with multiple processors, or across a set of multiple network-connected machines.

This tutorial explains how to an wrap an existing command line tool using the Microbase responder architecture. Although intended to be a simple example, most Microbase functionality is used. The resulting implementation may be executed inside a distributed computing environment, such as a Condor Grid or a cloud provider such as Amazon EC2.

Motivation

Many bioinformatics tools are computationally intensive to run, or become computationally intensive when there are a large number of data files that need to be analysed. Large computational tasks are typically split into smaller units of work, termed ‘jobs’ for the remainder of this article. These smaller units may then be executed within different threads running on different CPUs, or even processes distributed across a number of computers. Even with parallelisation, computational work may take many hours or days for large data sets.

Whilst large performance increases are possible with parallel computation, additional complexity arises from the distributed environment itself. Hardware problems such as disk or networking failures may result in a higher number of process crashes than would be expected if the work were to be performed on a single machine. Additionally, a variety of processor architectures, hardware capabilities, and operating systems form a highly heterogenous environment in which unforeseen compatibility problems might arise. Different versions of specialised analysis software may be present on different machines, or even not be installed at all. Fnally, complexity is increased further if there is a requirement to execute multiple analysis tools, either independently or as a pipeline, where the output data from one tool must flow as input data into the next program.

Microbase

Microbase has been developed as a platform on which scalable, and repeatable distributed pipelines can be constructed. The vast majority of progams can be executed within the Microbase system unmodified. This is important, since many applications are provided by their publishers as binary-only packages. The pipeline developer is responsible for constructing wrappers around applications that mask the presence of the distributed environment; existing applications will have no concept of executing as part of a pipeline in a distributed environment. These wrappers are termed ‘ responders‘, they start computational work as a response to a system event (these events are covered in more detail later). The system provides the following functionality to application developers:

  • Management of software installations. The system can install software temporarilly on general purpose machines, or utilise pre-installed software on dedicated servers, enabling a wide range of computing hardware to participate in distributed analyses.
  • A distributed filesystem provides a universal means of addressing data files. Microbase provides a Service Provider Inteferace (SPI) for reading, writing and querying for files. There are currently several plugins available that provide a number of transport mechanisms, including FTP, Amazon S3, BitTorrent, as well as local disk files.
  • A publish/subscribe notification system coordinates all tasks that occur within a Microbase system. Responders are subscribed to particular message topics. These responders then react to messages published by other responders if the topic of the message relevant to them. Small amounts of arbitrary data can be attached to messages, allowing interprocess communication between loosely-coupled responders. For example, a responder might publish a message that signifies that new data of some kind is available. This message might then be of interest to one or more other subscribed responders, triggering different analyses of that data to take place.

In addition to the features visible to the responder developer, Microbase also performs various consistency checks, logging operations and maintenance of execution statistics about a running system. Keeping track of which jobs have been executed, which machine they executed on, and whether they have succeeded or failed is an essential housekeeping task in order to ensure a consistent and complete analysis. Computational jobs may fail for one of three reasons: a problem with the job implementation itself; a problem with the input data being in an unexpected or illegal format; a problem caused by the execution environment, such as an incorrectly copied file or hardware failure. If a computational job fails, any changes that were made to shared data structures must be undone, before the job is automatically re-executed. Jobs that consistently fail after numerous re-execution attempts are assumed to have an internal problem, or have bad input data, rather than failing as a result of transient problems with the distributed environment.

Writing a Microbase job

Before writing a job, it is first essential to decide how the work can be most efficiently parallelised. The best-case scenario is where a large computational task can be split into multiple independent jobs – i.e., jobs require little or no communication with shared resources, and can execute in isolation from one another. Such jobs are ’embarrassingly parallel’, and typically achieve n times speedup with respect to the number of CPUs.

The process of writing a compute job is similar in many respects to writing a standard command line application. In Java, a typical application will begin with a ‘main’ method as follows:

[sourcecode language=”java”]

public class MyProgram
{

public static void main(String[] args)

{

System.out.println(“Hello World!”);

}

}

[/sourcecode]

Various application-specific command line arguments could be passed to the simple program above, via the ‘ args‘ parameter. These arguments would then be processed by the application appropriately – some might be literal values, others might need to be parsed into a numerical type, while others might be paths to files stored on a local disk.

In a Microbase responder, the equivalent “hello world” application would be as follows:

[sourcecode language=”java”]

public class MyProgram

extends MessageProcessor

{

@Override

public Set<Message> processMessage(Message msg)

throws MessageProcessingException

{

System.out.println(“Hello World!”);

}
}

[/sourcecode]

For this simple case, there is very little difference between the implementations. The Microbase responder has a different entry point to the program: processMessage() instead of main(). Instead of the ‘ args‘ variable to pass runtime configuration parameters, there is a variable ‘ msg‘ that references the message that the program will need to process.

Messages contain data in the form of key → value pairs, and can be used to pass information among responders. For the purposes of this example, assume that the incoming message contains information about a data file stored in the Microbase filesystem. The following example shows how this file can be obtained.

[sourcecode language=”java”]

public class MyProgram

extends MessageProcessor

{

@Override

public Set<Message> processMessage(Message msg)

throws MessageProcessingException

{</span>

String fileBucket = msg.getContent().get(“file_bucket”);

String filePath = msg.getContent().get(“file_path”);

String fileName = msg.getContent().get(“filename”);

&nbsp;

FileInfo inputFileDesc =

new FileInfo(fileBucket, filePath, fileName);

&nbsp;

InputStream inputStream =

runtime.getMicrobaseFS(inputFileDesc);

// Do something with standard InputStream

inputStream.close();

}

}

[/sourcecode]

Three parameters are sent via the notification message to the responder: ‘file_bucket’, ‘file_path’, and ‘filename’. These three parameters uniquely identify a file stored in the Microbase filesystem. These are first aggregated into a FileInfo object.

The ‘runtime’ instance is provided by the ‘MessageProcessor’ superclass, and provides a large amount of information about the running Microbase system, as well as information about the machine that the current responder instance is executing on. In this case, we simply need access to the Microbase filesystem. A standard Java InputStream can be obtained by passing a reference to the file descriptor that is required. Note that no implementation-specific information is needed. The Microbase platform identifies which machine the file is actually located on, and uses one of the configured transport plugins to transfer the file, before returning the stream to the caller.

This tutorial has introduced Microbase and provided a basic responder implementation. A virtual machine containing these examples will be made available <<TODO URL>>.

2 Comments

  1. Jochen Weile

    June 22, 2011 @ 5:22 pm

    Review of Microbase Responder Tutorial

    This is a review of the Microbase Responder Tutorial. The tutorial gives a brief introduction to the motivation behind parallelization efforts in bioinformatics, and its difficulties. Subsequently, the general functionality of microbase is explained, including software installation management, a shared file system, a notification system and various housekeeping services. The article’s second part gives short examples on how to write simple components for Microbase.

    The text is generally well written and gives a short and concise overview over the topic as well as giving easily understandable instructions for a user’s first steps with the Microbase system.

    A few minor issues:
    * It would be nice to see reference any Microbase papers for further reading.
    * Regarding the code examples:
    * It would be good to know which libraries need to be used, and where they are available?
    * It does not become clear whether the message keys (“file_bucket”, “file_path” and “filename”) are part of a controlled set of terms. How does the user know which keys to use?
    * The text contains a few minor typos and mistakes.

  2. Introducing the Bioinformatics Knowledgeblog | Knowledge Blog

    June 28, 2011 @ 3:36 pm

    […] these articles are tutorials, covering everything from large data integration suites like Ondex and cloud/grid computing infrastructure to metabolic modelling and some of the many facets of […]

Leave a comment

Login