RO-Crate for Biodiversity Genomics

Biodiversity Genomics RO-Crate Profile

UNDER CONSTRUCTION

This is a first draft currently in active development. Everything is subject to change.

Overview

The aim of this profile is to provide a full description of the provenance of biodiversity genomics data. This means capturing and connecting the different steps of biodiversity genomics research, including:

These steps represent a mixture of physical and computational processes. They can be broadly grouped into three main stages:

At each stage, accession numbers, authors, affiliations, and additional metadata are collected. The stages are connected through these accession numbers, and the steps within them are connected as “actions” which represent the processes and workflows used.

This profile takes heavy inspiration from the Process Run Crate profile, specifically in how processes are connected to inputs, outputs, and tools through the use of CreateAction. As Process Run Crate is intended only for describing computational processes, we aim to generalize its approach to work in additional contexts. The Provenance of entitiespage of the RO-Crate specification is also relevant here.

Notes before reading

Note the distinction between Bioschemas LabProtocol (a sequence of tasks and operations executed to perform experimental research) and LabProcess (the specific application of a LabProtocol to some input (biological material or data) to produce some output (biological material or data)). This is analogous to the prospective and retrospective provenance ideas presented in Workflow Run Crate.

This spreadsheet RO-Crate Bioschema mapping is the original reference for which metadata should be mapped to which terms in RO-Crate and particularly Bioschemas. Where information is omitted here in this draft, it may be present in that spreadsheet (though eventually all of the spreadsheet information should be in this profile).

The ISA (Investigation, Study, Assay) metadata framework and ISA RO-Crate profile may map onto this draft profile cleanly (with the root dataset as the Investigation and the stages as Studies). That profile is already supported by some existing platforms and projects (e.g. FAIRDOM-SEEK, FAIR Data Station, ARC framework). A future version of this profile may incorporate or inherit requirements from the ISA profile in order to improve interoperability.

The Common Provenance Model (CPM) and the CPM RO-Crate profile also offer a path to more detailed, distributed provenance metadata. Incorporating CPM would better support the step-by-step construction of the provenance chain during the research pipeline, rather than at the end of it. For now, this profile focuses mainly on the publication use case, where all provenance information is collected from disparate database sources once the analysis is complete and ready to publish. A future version of this profile may incorporate or inherit requirements from the CPM profile.

Core Metadata

The crate should have all metadata in this section.

Root Data Entity

The root data entity should include the following properties:

Species/taxon

A species should be represented using the Bioschemas Taxon type (https://bioschemas.org/Taxon). Its properties name, scientificName and taxonRank should be present.

Stages

At all stages, there may be multiple samples or data entities - these can be grouped together using an entity of type Collection with hasPart linking to the individual entities. A Collection may be used in any object or result property on a CreateAction instead of an individual entity, provided that the entities in its hasPart are of the expected type for that process.

Most processes and objects are optional as not all this data may be collected or machine-retrievable in all cases. Where there are gaps, placeholder entities can be used - these should have an @id with a local identifier* and a name and description explaining what the placeholder is representing.

*local identifiers should start with # and include a UUID to ensure uniqueness. The UUID is useful to avoid duplicate entities when many RO-Crates are combined into a knowledge graph.

Stage: Sample Collection and Preservation

At minimum, there MUST be an entity representing a biobanked sample. This entity should have the Bioschemas BioSample type (https://bioschemas.org/BioSample). There MAY be multiple entities representing one sample each.

Additional provenance information for the biobanked sample MAY be provided. This SHOULD be in the form of a chain of BioSample entities connected by LabProcess entities (a subclass of Action in schema.org). For example:

BioSample entity metadata

Each BioSample entity’s @id MUST be either:

If the @id resolves to another RO-Crate, additional metadata should be added according to Referencing other RO-Crates.

A BioSample entity MUST include the following properties:

If the @id does not resolve to another RO-Crate, the following additional metadata SHOULD be added to describe the sample:

LabProcess entity metadata

LabProcess entities MUST have the following properties:

LabProcess entities SHOULD have the following properties:

Stage: Wet lab and sequencing

At minimum, there MUST be an entity representing the sequenced data. This SHOULD be a data entity representing the data itself (so should have type File or Dataset; the data may be web-based). There MAY be multiple entities representing multiple data files.

Additional provenance information for the sequenced data MAY be provided. This SHOULD be in the form of BioSamples connected to File/Dataset entities by LabProcess and CreateAction entities, and the chain SHOULD be connected to the biobanked sample from the previous stage. For example:

LabProcess entities and BioSample should be described as above. Computational processes should be described as CreateActions following the Process Run Crate or Workflow Run Crate patterns.

Stage: Computational Analysis

At minimum, there MUST be an entity representing the analysis results. This SHOULD be a data entity representing the data itself (so should have type File or Dataset; the data may be web-based). There MAY be multiple entities representing multiple data files.

Additional provenance information for the sequenced data MAY be provided. This SHOULD be in the form of File/Dataset entities connected by CreateAction entities, and the chain SHOULD be connected to the sequenced data from the previous stage. For example:

This stage should follow the style of the Workflow Run Crate profile, with the following exceptions:

It is possible to make a RO-Crate that is fully compliant with both this profile and the Workflow Run Crate profile, in which case it can be declared as conforming to both profiles. This may be appropriate where the computational analysis is the primary focus of the crate (regardless of how much provenance is provided for the earlier stages).

Any computational workflow files included directly in the RO-Crate should follow the style of the Workflow RO-Crate profile, with the following exceptions:

Multiple analyses may be chained together.

Multiple processes per stage

It is common for multiple samples to feed into a single computational analysis, or for one analysis output to be the input of multiple secondary analyses. As such, it is permitted to have multiple “core” entities for each stage, each with their own provenance chain. Between one stage and the next, the core entities may be connected in one-to-one, many-to-one, one-to-many, or many-to-many relationships.