Xtremely Large File Systems for the small collaborative world
Arun Jagatheesan, Dice Research SDSC

Even though the "file" as we know it remains the same, the "filesystem"
that manages the files keep changing. The changes in the filesystems
are not just caused by the scale or the size of files to be stored,
but by the fundamental differences in the scope of what a filesystem
is expected to be in future. In this (proposed) talk, we will look
into a current use case that is pushing the envelope of filesystems
and our current solution.

The LSST project is expected to manage 200+ petabytes of replicated
data, distributed in several countries. LSST is an optical telescope
constructed in Chile through federal and private funding sources.
LSST telescope has a 3 billion-pixel camera that will continue to
capture images of the sky for more than 10 years of its initial
operation. The images will be stored as files. The system that will
store and manage this large number of files would provide features
that differentiates it self from the scope of a regular "file
system". Some of the expected requirements and features of this
proposed "file system" includes:

      * Include heterogeneous storage resources such as high-speed
	disks, network storage, archival storage etc., from multiple
	partners located in different parts of the world, as part
	of a logical storage pool to store the files based on access
	patterns and storage policies.

      * Allow contradicting needs of consistency and distribution:
	While allowing any storage resource from any partner country
	to participate in the LSST collaboration (in a p2p manner),
	allow a centralized consistency of all files in the file-tree
	(logical namespace).

      * Manage the lifecycle of the files: Ingest files from the
	images that are created by the telescope in Chile, archive
	a replica in Chilean data center, create/transfer another
	replica for processing in US data center and archive a
	geographically distant replica in US.  All these data
	transfers have to take place automatically within in the
	whole system based on replication policies.

      * Provide automatic selection of the appropriate replica of
	a file, transparent to the user. In addition, allow users
	to discover files by querying the metadata (apart from the
	ability to traverse the file- tree using traditional
	directories).

Clearly, projects like these show the emerging trend in enterprise
computing rather than an isolated problem in scientific data
management.  Global companies will face similar problems when they
are required to serve a large amount of data using multiple data
centers around the world or by using the much-hyped cloud-providers.
The talk will provide an overview of the LSST requirements and our
solution-using database and grid technologies.