This is a first draft currently in active development. Everything is subject to change.
The aim of this profile is to provide a full description of the provenance of biodiversity genomics data. This means capturing and connecting the different steps of biodiversity genomics research, including:
These steps represent a mixture of physical and computational processes. They can be broadly grouped into three main stages:
At each stage, accession numbers, authors, affiliations, and additional metadata are collected. The stages are connected through these accession numbers, and the steps within them are connected as “actions” which represent the processes and workflows used.
This profile takes heavy inspiration from the Process Run Crate profile, specifically in how processes are connected to inputs, outputs, and tools through the use of CreateAction
. As Process Run Crate is intended only for describing computational processes, we aim to generalize its approach to work in additional contexts. The Provenance of entitiespage of the RO-Crate specification is also relevant here.
Note the distinction between Bioschemas LabProtocol (a sequence of tasks and operations executed to perform experimental research) and LabProcess (the specific application of a LabProtocol to some input (biological material or data) to produce some output (biological material or data)). This is analogous to the prospective and retrospective provenance ideas presented in Workflow Run Crate.
This spreadsheet RO-Crate Bioschema mapping is the original reference for which metadata should be mapped to which terms in RO-Crate and particularly Bioschemas. Where information is omitted here in this draft, it may be present in that spreadsheet (though eventually all of the spreadsheet information should be in this profile).
The ISA (Investigation, Study, Assay) metadata framework and ISA RO-Crate profile may map onto this draft profile cleanly (with the root dataset as the Investigation and the stages as Studies). That profile is already supported by some existing platforms and projects (e.g. FAIRDOM-SEEK, FAIR Data Station, ARC framework). A future version of this profile may incorporate or inherit requirements from the ISA profile in order to improve interoperability.
The Common Provenance Model (CPM) and the CPM RO-Crate profile also offer a path to more detailed, distributed provenance metadata. Incorporating CPM would better support the step-by-step construction of the provenance chain during the research pipeline, rather than at the end of it. For now, this profile focuses mainly on the publication use case, where all provenance information is collected from disparate database sources once the analysis is complete and ready to publish. A future version of this profile may incorporate or inherit requirements from the CPM profile.
The crate should have all metadata in this section.
The root data entity should include the following properties:
hasPart
: must include the data objects included within the crate.mentions
: must include all the processes described (i.e. all the CreateAction
s)about
: links to a Taxon entity for the species being described (same as taxonomicRange
)taxonomicRange
: links to a Taxon entity for the species being described (same as about
)identifier
: should include a BioProject identifier if one exists for the projectmainEntity
: must reference one of the following three types of entities according to the focus of the crate:
BioSample
s representing the sample(s)File
s or Dataset
s representing the raw genetic dataFile
s or Dataset
s representing the primary output(s) of the analysisfunder
: should reference an Organization
representing the project that funded the work. See Funding and grants from the core RO-Crate spec.A species should be represented using the Bioschemas Taxon type (https://bioschemas.org/Taxon). Its properties name
, scientificName
and taxonRank
should be present.
At all stages, there may be multiple samples or data entities - these can be grouped together using an entity of type Collection
with hasPart
linking to the individual entities. A Collection
may be used in any object
or result
property on a CreateAction
instead of an individual entity, provided that the entities in its hasPart
are of the expected type for that process.
Most processes and objects are optional as not all this data may be collected or machine-retrievable in all cases. Where there are gaps, placeholder entities can be used - these should have an @id
with a local identifier* and a name
and description
explaining what the placeholder is representing.
*local identifiers should start with #
and include a UUID to ensure uniqueness. The UUID is useful to avoid duplicate entities when many RO-Crates are combined into a knowledge graph.
At minimum, there MUST be an entity representing a biobanked sample. This entity should have the Bioschemas BioSample type (https://bioschemas.org/BioSample). There MAY be multiple entities representing one sample each.
Additional provenance information for the biobanked sample MAY be provided. This SHOULD be in the form of a chain of BioSample
entities connected by LabProcess
entities (a subclass of Action
in schema.org). For example:
Each BioSample entity’s @id
MUST be either:
https://identifiers.org/ena.embl:SAMEA114402090
not just SAMEA114402090
).If the @id
resolves to another RO-Crate, additional metadata should be added according to Referencing other RO-Crates.
A BioSample entity MUST include the following properties:
name
: A human-friendly name for the sample. This SHOULD be unique within the crate. Example: “Culex laticinctus sample SAMEA114402090”If the @id
does not resolve to another RO-Crate, the following additional metadata SHOULD be added to describe the sample:
identifier
: URI, accession, or other identifier(s) of the sample in a database (BioSamples, COPO, etc). This may lead to a landing page or API endpoint with additional metadata.collector
: person who collected the samplemaintainer
/ custodian
: person who is or was responsible for custody of the samplecontributor
: person who identified the samplelocationOfOrigin
: where the sample was collectedcollectionMethod
and ethics
: Details/locations of permits and/or ethical/legal documentation/complianceLabProcess
entities MUST have the following properties:
@id
: MUST be a unique identifier, the use of randomly generated UUIDs (type 4) is RECOMMENDED.instrument
: SHOULD link to a LabProtocol
entity for the SOP usedLabProcess
entities SHOULD have the following properties:
name
description
endTime
: The datetime that the process was completed.agent
: link to a Person
who carried out the processobject
: link to entities representing the inputsresult
: link to entities representing the outputAt minimum, there MUST be an entity representing the sequenced data. This SHOULD be a data entity representing the data itself (so should have type File
or Dataset
; the data may be web-based). There MAY be multiple entities representing multiple data files.
Additional provenance information for the sequenced data MAY be provided. This SHOULD be in the form of BioSample
s connected to File
/Dataset
entities by LabProcess
and CreateAction
entities, and the chain SHOULD be connected to the biobanked sample from the previous stage. For example:
LabProcess
entities and BioSample
should be described as above. Computational processes should be described as CreateAction
s following the Process Run Crate or Workflow Run Crate patterns.
At minimum, there MUST be an entity representing the analysis results. This SHOULD be a data entity representing the data itself (so should have type File
or Dataset
; the data may be web-based). There MAY be multiple entities representing multiple data files.
Additional provenance information for the sequenced data MAY be provided. This SHOULD be in the form of File
/Dataset
entities connected by CreateAction
entities, and the chain SHOULD be connected to the sequenced data from the previous stage. For example:
This stage should follow the style of the Workflow Run Crate profile, with the following exceptions:
conformsTo
and mainEntity
@id
of the workflow entity should ideally be a DOI, and the registry should support FAIR Signposting to retrieve the workflow as a RO-Crate from its DOI.FormalParameters
can be ignored (though PropertyValue
s should still be used to capture parameter values where possible)It is possible to make a RO-Crate that is fully compliant with both this profile and the Workflow Run Crate profile, in which case it can be declared as conforming to both profiles. This may be appropriate where the computational analysis is the primary focus of the crate (regardless of how much provenance is provided for the earlier stages).
Any computational workflow files included directly in the RO-Crate should follow the style of the Workflow RO-Crate profile, with the following exceptions:
mainEntity
Multiple analyses may be chained together.
It is common for multiple samples to feed into a single computational analysis, or for one analysis output to be the input of multiple secondary analyses. As such, it is permitted to have multiple “core” entities for each stage, each with their own provenance chain. Between one stage and the next, the core entities may be connected in one-to-one, many-to-one, one-to-many, or many-to-many relationships.