[A number of cloud organizations are offering free hosting for open scientific datasets. If this trend continues to develop it will create a further impetus for “open” data which is a result of government funded research as well as the architecture of research networks and the need for large physical scientific computational facilities. One of the challenges facing researchers in the use of clouds or virtualization is the high cost of storage and the time it takes to ship large datasets over the network. Free storage and collocating large datasets with the computational cloud eliminates both challenges. Here are some pointers to initiatives in this area. Thanks to Richard Ackerman’s blog for this pointer – BSA]
Google provides free storage for datasets
Under Project Palimpsest, Google will be providing free storage and public access to large scientific data sets in what could be a major data organization challenge.
The storage would fill a major need for scientists who want to openly share their data, and would allow citizen scientists access to an unprecedented amount of data to explore. For example, two planned datasets are all 120 terabytes of Hubble Space Telescope data and the images from the Archimedes Palimpsest, the 10th century manuscript that inspired the Google dataset storage project.
The challenge would be in the ways in which Google is able to represent the data to the public. Also, the Trendanalyzer acquisition would come really handy here. The data source is open to the public which means that additions can be made to it as well.
There is also a presentation available at SearchEngineJournal on the talk delivered last May. And if you have huge datasets that just won’t get uploaded, Google is providing 3TB disk arrays for shipping them whole file system for the dataset.
Talis Connected Commons
Talis Connected Commons is about fostering the Linked Data community, by providing a rich hosting service:
For qualifying data sets, Talis will provide, through the Talis Platform:
• Free hosting of up to 50 million RDF triples and 10Gb of content
• Access to data access services that operate on that data, including data retrieval and text search
• Free access to a public SPARQL endpoint for each dataset.
I asked Leigh how this fits with the Talis Project Xiphos initiative, and he explained that Xiphos is a more focussed initiative around "data in the education, library and publishing sectors", whereas Connected Commons is about any kind of data.
Talis, like Amazon, understands that a modern business is about fostering an ecosystem, a combination of shared data and services that can be used as a platform for software development and business development.
Amazon Public Datasets service
Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications.
Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. Users can also discuss best practices and solutions in the dedicated Public Data Sets forum.
By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.