Storage, Toolspace, Access and analytics for Big Data Empowerment
- Website: https://datastage.io/
- Started: August 1, 2018
- Duration: 12 months
- Project reference: https://projectreporter.nih.gov/project_info_details.cfm?aid=9732880&icde=44591431
The NHLBI’s DataSTAGE project will develop innovative computing solutions that meet the needs of the NHLBI and our research community, building on the cloud-based infrastructure of the NIH Data Commons, which kicked off in 2017. NHLBI’s DataSTAGE is a cloud-based platform, or technical framework, for tools, applications, and workflows. DataSTAGE provides secure workspaces to share, store, cross-link, and compute large sets of data generated from biomedical and behavioral research.
Seven Bridges Genomics and the Elsevier Mendeley Data team collaborate in the DataSTAGE project in Team Xenon to continue development of FAIR4Cures, which was initiated as a NIH Data Commons pilot in the Data Commons Pilot Phase Consortium (DCPPC) and now carried forward as part of DataSTAGE (see also NHLBI DataSTAGE project site).
FAIR4Cures aims to build a FAIR genomics workflow execution and data management solution. The core components are:
- Common Workflow Language for interoperable definitions of the computational workflows
- Seven Bridges Platform for executing scalable workflows
- Research Object for packaging the result dataset including executed worklow as BDBags
- Mendeley Data for archiving/publishing the research objects
- GUID Broker for assigning and resolving permanent identifiers of workflows, datasets and research objects, using DOIs and MinIDs
eScience Lab contribution
In DataStage, eScience Lab is a subcontractor to Mendeley Data to develop the Research Object Composer and consult on Research Object structure, building on our existing collaboration with Seven Bridges in the Common Workflow Language project.
The role of the Research Object Composer is to be the bridge between the workflow execution platform Seven Bridges Platform and the repository Mendeley Data. Instances of the composer expose a REST API for the workflow platform and any other clients to incremenetally build a research object according to the slots defined in a specified profile (e.g. “prospective workflow run”), validate it according to the underlying JSON and SHACL schemas, and build a BDBag to submits the archived RO to the repository. A Jupyter Notebook demonstrates how the RO Composer API can be used by clients.
Additional responsibilities we are exploring for the RO Composer is to handle snapshotting and registering of individual data files and workflows using MinIDs and checksums, as well as tracking evolution of Research Objects built using the composer.
The RO Composer is generic for building according to any Research Object profile, so we are also planning to extend it to build BioCompute Objects for a PrecisionFDA challenge, as well as building more specific profiles for RO-Crate.