Paul Greenfield
CSIRO

The genomics research community is a heavy user of database technology
but their focus is largely on databases as simple repositories of
genetic data.  The conventional genomics processing pipeline stores
genetic sequence data as long strings that are retrieved in their
entirety from databases and searched externally using pattern-matching
tools such as BLAST and customised Perl scripts. This approach is
somewhat more sophisticated than the 'FTP and grep' model of
processing astronomical data once described by Jim Gray, but it
falls well short of being able to directly answer biological questions
by running queries over suitably structured databases - and this
is the eventual goal of the work I would like to discuss at HPTS
2009.

My current work on genomic databases and queries is focussed on
bacteria, as they have smaller and simpler genomes, and large numbers
of them have been sequenced already. Bacteria also have the advantage
of having been diverg ing genomically for billions of years, so
distantly related organisms share few DNA sequences. One of my
current projects compares the genomes of 700+ bacteria to each other
in a single pass through a database of bacterial genomes. This
program performs about 5 billion short-sequence-look-up queries in
about 14 hours on a quad-core workstation, at an average speed of
about 100,000 database queries/second. This level of performance
is achieved through careful use of indices and synchronised look-up
threads to make effect ive use of the database's read-ahead and
buffering strategies. The result of this analysis is a very high-level
view of how different species of bacteria are related, and shows
where current bacterial taxonomies may need to be revised. This
same database can also answer queries about gene sharing be tween
organisms, and can be used to shed light of the structure of bacteria
l communities ('metagenomics'). These bacterial databases and the
applications that query them are highly partionable and scalable -
making them good candidates for implementation using map-reduce
algorithms on large-scale clusters and clouds.

The work I would most like to discuss with the HPTS community is
storing and querying large numbers of large, complex genomes. The
cost of DNA sequencing is falling rapidly and promising to fall
even faster in the next few years. There will certainly be thousands
of complete human genomes available to researchers in the next few
years, and perhaps many more than that if the promises of the
sequencing technology vendors can be believed. The challenge will
be structuring and storing this volume of data (at least 6
giga-basepairs per genome) as something more than semi-structured
sets of strings, something that will let researchers answer questions
about differences between populations, and the relationship between
genetic differences and diseases - a challenge very similar to the
astronomical one addressed by Jim Gray and Alex Szalay.

We have also been continuing the work on consistency for loosely-coupled,
service-based applications - the 'Promises' work that I discussed
at the last HPTS and at CIDR in 2007. Most recently we have looked
at what we called property-based promises and how promises over
abstract resources could be effectively implemented. If I am invited
to HPTS again this year, I look for ward to discussing with Pat
Helland and others just what consistency means in a loosely-coupled
world and what patterns and technologies could give 'good enough'
consistency.