Federated Learning RO-Crate Profile
Version: 0.1
Permalink: N/A – use draft link https://esciencelab.org.uk/federated-learning-ro-crate-profile/federated-learning-profile.html
Authors:
- Eli Chadwick, https://orcid.org/0000-0002-0035-6475
- Ana T. Freitas, https://orcid.org/0000-0002-2997-5990
- Carles Hernandez, https://orcid.org/0000-0001-5393-3195
Example metadata file: JSON-LD, HTML preview
© 2025 The University of Manchester, UK
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Overview
This profile provides guidance on how to describe federated learning processes using RO-Crate. It is an extension of Process Run Crate where the input is decentralized data plus configuration files, and the output is a single model trained on that data with that configuration.
This profile is currently designed for “horizontal” data partitioning strategies, where each client site holds the same variables for a different cohort. It does not cover “vertical” strategies, where different clients hold different variables for the same cohort.
Aims
At minimum: Inform a recipient of the federated learning configuration that was used, so that they can reproduce it.
Ideal: Enable the federated learning process to be re-run automatically by providing a standard way to document configuration values.
Compatibility
This profile is based on RO-Crate 1.2 and aims to be compatible with other profiles used in trusted research environments and workflows, including Five Safes RO-Crate (0.4+) and the Workflow Run RO-Crate family.
Inheritance
This profile inherits all the requirements from Process Run Crate, a profile designed to capture the execution of one or more computational tools. This ensures consistency in the core metadata structure of the crate.
To summarise this profile as an extension of Process Run Crate: the CreateAction represents the learning process, with object referencing the training datasets AND the learning configuration, result referencing the output model, and instrument referencing the federated learning framework used (e.g. Flower).
Example Metadata Document (ro-crate-metadata.json)
Example metadata file: JSON-LD, HTML preview.
Input data
Each dataset used for training SHOULD be represented by a data entity in the crate. The data itself MAY be access controlled.
In data entities representing training datasets:
@idSHOULD be a persistent identifier for the dataset- license SHOULD be included. For public datasets this could be an open license, for restricted or sensitive datasets this can describe the conditions of access
- conformsTo MAY reference a common data model or phenotype dictionary that data in the dataset follows, e.g. OMOP mapping, GA4GH Phenopackets, Frictionless description
- subjectOf MAY reference a contextual or data entity describing a Data Management Plan for the dataset
- If Croissant metadata is available for the dataset, this MAY also be included
Each entity representing a training dataset MUST be referenced from object on the CreateAction which describes the training execution (see Federated Learning Process Execution).
Data partitioning strategy
The federated learning process described in the crate is assumed to use a “horizontal” data partitioning strategy, where each client site holds the same variables for a different cohort.
Future versions of this profile may also support “vertical” data partitioning, where different clients hold different variables for the same cohort.
Federated Learning Tools and Configuration
It is assumed that there is, at minimum, a tool or script that is distributed to clients and used to train the model on local data. Depending on the architecture or framework used there may be additional tools or scripts, for example to configure a centralized server or aggregator.
Training tool or workflow
The training could be orchestrated and run using a specific federated learning framework (e.g. Flower), a general software tool (e.g. Python), or a computational workflow (e.g. a Nextflow workflow), according to how the learning process is designed.
The relevant tool or workflow MUST be described using a contextual entity in the crate which includes the following properties:
- SHOULD have type SoftwareApplication, SoftwareSourceCode or ComputationalWorkflow (may also have other types)
- SHOULD include version with the version of the framework or application used
That entity MUST be referenced from instrument in the CreateAction describing the training execution (see Federated Learning Process Execution).
If a computational workflow is used, the crate MAY also include further metadata to conform to Workflow Run Crate.
Training configuration – as files
Where the training is configured using configuration files or scripts, those files SHOULD be included in the crate and described using data entities.
Those entities SHOULD be referenced from object in the CreateAction describing the training execution (see Federated Learning Process Execution).
Training configuration – as environment variables
Configuration that is provided using environment variables should be described using PropertyValue entities, as in Process Run Crate: Representing environment variable settings
Federated Learning Process Execution
It is assumed that the training process will usually be captured as a single CreateAction.
Execution of the training process
A CreateAction entity MUST be present which describes the execution of the training process using the following properties:
- instrument MUST reference the entity which describes the training tool or workflow
- object:
- SHOULD reference all the input datasets used for the training
- SHOULD reference any configuration files used by the instrument
- result:
- MUST reference the entity describing the output model
- MAY reference performance metrics of the model or training process (excluding resource usage)
- environment MAY reference environmental variables used for configuration
- resourceUsage MAY reference resource usage metrics for the training process
- Other properties (e.g. name, description, agent) SHOULD follow the guidelines set in Process Run Crate
Pre-processing and post-processing
Additional CreateActions MAY be included in the crate to describe pre- and post- processing steps. See Process Run Crate: Multiple processes.
Note that if those pre- or post-processing steps are part of an automated workflow, they may be sufficiently described by using Workflow Run Crate or Provenance Run Crate.
Metrics - resource usage
Resource metrics – such as memory usage, execution time, estimated carbon cost, etc. – MAY be included in the crate. If they are they SHOULD follow the guidance in Provenance Run Crate: Representing resource usage.
Note 2026-02-26: this link does not yet work as the material is not yet merged into RDMkit
For guidance on best-practice metrics to collect for federated learning, see RDMkit: Federated Learning.
Metrics - model performance
Metrics that describe the performance of the training process and/or the trained model – such as drift detection metrics, loss/accuracy metrics, client-participation rate, etc. – MAY be included in the crate. If included, they SHOULD be described using PropertyValue entities, and those entities MUST be linked from result on the CreateAction (along with the model itself, see Output model).
A PropertyValue instance used to represent a performance metric MUST have a unique identifier representing the quantity being measured as its propertyID, and SHOULD refer to a unit of measurement via unitCode, except for dimensionless numbers.
This aligns with the guidance on resource usage metrics above, except that the metrics are connected through result rather than resourceUsage.
Note 2026-02-26: this link does not yet work as the material is not yet merged into RDMkit
For guidance on best-practice metrics to collect for federated learning, see RDMkit: Federated Learning.
Output model
The crate MUST contain a data entity representing the output model. This could be a direct serialization of the model to file, or another representation of the model. The data entity:
- SHOULD have a persistent identifier as
@idif such an identifier exists for the model - SHOULD have license indicating usage conditions for the model; it is RECOMMENDED that an SPDX identifier is used. If no license is declared then the license from the Root Data Entity is assumed to apply to the model
- SHOULD declare encodingFormat and/or conformsTo with the format for the model. See RO-Crate: Adding detailed descriptions of File encodings and RO-Crate: File format profiles.
- MAY be a web-based data entity which MAY be access-controlled
The model MAY be further documented by one or more supplementary files, such as Model Cards or AI Model Passport. Where such files are represented as data entities within the crate:
- the model entity MUST reference them through subjectOf
- If the files were automatically generated during/at the end of the training process, the relevant CreateAction SHOULD reference them via result
Additional metadata
Sensitive Data
In processes where sensitive data is used, the Five Safes RO-Crate profile MAY additionally be followed.