View on GitHub

RO-Crate for Federated Learning

Using RO-Crate to capture federated learning processes and models

Federated Learning RO-Crate Profile

Version: 0.1
Permalink: N/A – use draft link https://esciencelab.org.uk/federated-learning-ro-crate-profile/federated-learning-profile.html
Authors:

Example metadata file: JSON-LD, HTML preview

© 2025 The University of Manchester, UK

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Overview

This profile provides guidance on how to describe federated learning processes using RO-Crate. It is an extension of Process Run Crate where the input is decentralized data plus configuration files, and the output is a single model trained on that data with that configuration.

This profile is currently designed for “horizontal” data partitioning strategies, where each client site holds the same variables for a different cohort. It does not cover “vertical” strategies, where different clients hold different variables for the same cohort.

Aims

At minimum: Inform a recipient of the federated learning configuration that was used, so that they can reproduce it.

Ideal: Enable the federated learning process to be re-run automatically by providing a standard way to document configuration values.

Compatibility

This profile is based on RO-Crate 1.2 and aims to be compatible with other profiles used in trusted research environments and workflows, including Five Safes RO-Crate (0.4+) and the Workflow Run RO-Crate family.

Inheritance

This profile inherits all the requirements from Process Run Crate, a profile designed to capture the execution of one or more computational tools. This ensures consistency in the core metadata structure of the crate.

To summarise this profile as an extension of Process Run Crate: the CreateAction represents the learning process, with object referencing the training datasets AND the learning configuration, result referencing the output model, and instrument referencing the federated learning framework used (e.g. Flower).

Example Metadata Document (ro-crate-metadata.json)

Example metadata file: JSON-LD, HTML preview.

Input data

Each dataset used for training SHOULD be represented by a data entity in the crate. The data itself MAY be access controlled.

In data entities representing training datasets:

Each entity representing a training dataset MUST be referenced from object on the CreateAction which describes the training execution (see Federated Learning Process Execution).

Data partitioning strategy

The federated learning process described in the crate is assumed to use a “horizontal” data partitioning strategy, where each client site holds the same variables for a different cohort.

Future versions of this profile may also support “vertical” data partitioning, where different clients hold different variables for the same cohort.

Federated Learning Tools and Configuration

It is assumed that there is, at minimum, a tool or script that is distributed to clients and used to train the model on local data. Depending on the architecture or framework used there may be additional tools or scripts, for example to configure a centralized server or aggregator.

Training tool or workflow

The training could be orchestrated and run using a specific federated learning framework (e.g. Flower), a general software tool (e.g. Python), or a computational workflow (e.g. a Nextflow workflow), according to how the learning process is designed.

The relevant tool or workflow MUST be described using a contextual entity in the crate which includes the following properties:

That entity MUST be referenced from instrument in the CreateAction describing the training execution (see Federated Learning Process Execution).

If a computational workflow is used, the crate MAY also include further metadata to conform to Workflow Run Crate.

Training configuration – as files

Where the training is configured using configuration files or scripts, those files SHOULD be included in the crate and described using data entities.

Those entities SHOULD be referenced from object in the CreateAction describing the training execution (see Federated Learning Process Execution).

Training configuration – as environment variables

Configuration that is provided using environment variables should be described using PropertyValue entities, as in Process Run Crate: Representing environment variable settings

Federated Learning Process Execution

It is assumed that the training process will usually be captured as a single CreateAction.

Execution of the training process

A CreateAction entity MUST be present which describes the execution of the training process using the following properties:

Pre-processing and post-processing

Additional CreateActions MAY be included in the crate to describe pre- and post- processing steps. See Process Run Crate: Multiple processes.

Note that if those pre- or post-processing steps are part of an automated workflow, they may be sufficiently described by using Workflow Run Crate or Provenance Run Crate.

Metrics - resource usage

Resource metrics – such as memory usage, execution time, estimated carbon cost, etc. – MAY be included in the crate. If they are they SHOULD follow the guidance in Provenance Run Crate: Representing resource usage.

Note 2026-02-26: this link does not yet work as the material is not yet merged into RDMkit
For guidance on best-practice metrics to collect for federated learning, see RDMkit: Federated Learning.

Metrics - model performance

Metrics that describe the performance of the training process and/or the trained model – such as drift detection metrics, loss/accuracy metrics, client-participation rate, etc. – MAY be included in the crate. If included, they SHOULD be described using PropertyValue entities, and those entities MUST be linked from result on the CreateAction (along with the model itself, see Output model).

A PropertyValue instance used to represent a performance metric MUST have a unique identifier representing the quantity being measured as its propertyID, and SHOULD refer to a unit of measurement via unitCode, except for dimensionless numbers.

This aligns with the guidance on resource usage metrics above, except that the metrics are connected through result rather than resourceUsage.

Note 2026-02-26: this link does not yet work as the material is not yet merged into RDMkit
For guidance on best-practice metrics to collect for federated learning, see RDMkit: Federated Learning.

Output model

The crate MUST contain a data entity representing the output model. This could be a direct serialization of the model to file, or another representation of the model. The data entity:

The model MAY be further documented by one or more supplementary files, such as Model Cards or AI Model Passport. Where such files are represented as data entities within the crate:

Additional metadata

Sensitive Data

In processes where sensitive data is used, the Five Safes RO-Crate profile MAY additionally be followed.