FAIR Data Submission Guidance for EDRN
(Draft)
Version: 1.0.1
Date: 2025-1-1
To align with the FAIR principles outlined in the EDRN Data Sharing Policy, the EDRN has developed a set of minimal requirements for submitting data to LabCAS, EDRN's biomarker data commons repository.
A primary goal of EDRN data collection is to ensure the reusability of the data by groups beyond those who originally collected it. Per the EDRN Data Sharing Policy, data must be made available for public use. This includes providing sufficient metadata and documentation to help users understand the data's configuration and structure. Below is guidance on the critical metadata that should accompany your data submission. Additionally, supplemental documents and readme files can be included to support the enhanced use of the data.
Core Metadata to Support Definition, Accessibility, and Structure of the Data
Metadata is critical to support the discoverability, interpretability, and usability of the data. LabCAS organizes data into Collections, Datasets, and Files, each with its own set of minimal metadata requirements. Additional metadata is also defined for various assay types. Those metadata are coordinated as Common Data Elements by research groups and should be added to increase the usability of the data.
Metadata for Collections, Datasets, and Files
This Metadata Check List details the required and optional metadata for Collections, Datasets, and Files for LabCAS data submission. For more comprehensive information, please refer to the EDRN Data Model.
De-Identification of Data
All data must be de-identified at your site before uploading to LabCAS. Additional guidance may be available in your studyโs SOP.
For Reference Sets (LTP2, PMRI, and BRSI), follow the Imaging Data Transfer SOP provided by the DMCC.
As described in the Metadata Check List, De-identification Method (Safe Harbor, or Expert Review) is required metadata for data submission. Please refer to Health and Human Services - Methods for De-identification of PHI for more information.
Data to Upload
This section details what's required and optional when uploading data.
ReadMe File, Ancillary Data, Data Dictionaries and Other Information
Methodology details should be included as part of the metadata. You can also include supplemental information explaining the algorithms and computations applied to the raw data. Additional data, such as ancillary data or clinical records, may also be uploaded to LabCAS as supporting documentation. Examples of supporting files include:
- ReadMe file: A text file explaining your data for easier understanding and reuse.
- Standard Operating Procedures (SOPs): Any SOPs followed during your study.
- Ancillary Data: Clinical or other data captured during the study.
- Data Dictionaries: Definitions of any ancillary data.
Data Files
To support reproducibility and facilitate robust analyses by external researchers, each data submission should include the applicable core data components. Please note that summary or aggregated data alone is insufficient; raw data and relevant clinical data are essential for meaningful reanalysis.
- Raw Data Files
Raw data files are the unprocessed outputs directly obtained from the instruments or data collection processes. Including raw data ensures that downstream analyses can be reproduced and verified. The raw data should be:- Unaltered and Complete: All data generated by the instruments or data collection processes should be included without modifications, filtering, or aggregation.
- In Original Format: Whenever possible, submit raw data in the original file format generated by the instrument (e.g., .fastq for sequencing data, .dcm for imaging data, .csv for sensor data). This helps maintain fidelity and compatibility for future reprocessing.
- Accompanied by Essential Parameters: Ensure any instrument parameters or settings (e.g., calibration details, machine specifications) are documented, either within the metadata or as a separate parameter file.
- Clinical Data Files
Clinical data files should include detailed participant information relevant to the study, ensuring that relationships between clinical characteristics and outcomes can be examined. The clinical data should contain:- Individual-Level Data: Each row should represent a single participant or specimen, with columns for each variable collected (e.g., age, gender, diagnosis, treatment details).
- De-identified and Anonymized: Ensure compliance with ethical guidelines by de-identifying personal information while preserving the clinical details necessary for analysis.
- Longitudinal or Follow-Up Data (if applicable): For studies with multiple time points, include data from each time point to enable temporal analysis of clinical outcomes.
File Inclusion Guidelines
When uploading files, include only those directly relevant to the study. For example, if you are uploading DICOM files from a CD-ROM, ensure only the DICOM files needed for analysis are included. Do not upload extraneous files such as software, image viewers, or other auxiliary content that may be included on the CD-ROM. Uploading unrelated files could create licensing issues or unnecessary data clutter.
Optional Files
You may optionally include a file with checksums to verify data integrity. This file can be in .csv format, where:
- The first column contains the path to each uploaded file.
- The second column contains the MD5, SHA-256, or another hash/checksum for the corresponding file.
When to Include Checksums:
- Recommended: When submitting large files or datasets to ensure files are complete and intact after upload.
- Optional: For smaller submissions or when checksum verification is not critical to your workflow.
Organization of Files and Folders
When uploading data, you must arrange the filesystem hierarchy (the folders and files that contain your data) according to the following structure:
- There must be a single top-level folder named after the collection of the data you are submitting.
- Within this folder, you must provide one more sub-folders that match each dataset within the collection. This should match the logical organization of your data. The dataset name is typically the an de-identified participant id, an event Id (for reference sets), or a site name. Any ancillary files can be included in a sub-folder labeled CollectionLevelFiles and they will be included at the collection level.
- Datasets may themselves contain nested datasets, which are represented by nested dataset folders named for each nested dataset.
- Within the dataset folders, provide the files (DICOM images, EDF files, MINC files, OME-TIFF, CZI, etc.) that contain the actual data.
An example structure is shown below:
๐ collection ๐ CollectionLevel ๐ ReadMe.txt ๐ SOP 1.pdf ๐ ClinicalData.csv ๐ DataDictionary.csv ๐ dataset 1 ๐ participant 1 ๐ (optional nested datasets) ๐ file 1 ๐ file 2 ๐ file 3 . . . ๐ participant 2 ๐ (optional nested datasets) ๐ file 1 ๐ file 2 ๐ file 3 . . . ๐ dataset 2 ๐ participant 1 ๐ (optional nested datasets) ๐ file 1 ๐ file 2 ๐ file 3 . . . ๐ file n ๐ file n+1 ๐ file n+2 . . .
As mentioned before, do not include viewer software, AUTORUN.INF files, .exe files, .app folders, .DLL files, LICENSE files, Java files, etc.
Your data must also be self-contained; this means that each file contains all of the data needed to describe itself (aside from the metadata you submit separately). Examples:
- mycollection/12345/10000.dcm is a valid; it is a single data file (10000.dcm) for event id dataset 12345 that belongs to "mycollection".
- mycollection/torso.dcm plus mycollection/explanation.xlsx where the Excel spreadsheet tells the event ID for "torso.dcm" violates the accepted upload format.
You may package your filesystem hierarchy into an archive. We can accept .zip, .tar, .tar.gz, .tar.bz2, and .tar.xz files.
Review and Verification
Validation of required metadata will be performed during the upload of data to LabAS. In addition, each site and/or assigned domain expert, must review and validate their data to ensure accuracy in data capture and usability by others in LabCAS. This is a critical step in ensuring that they can be shared and used by other research groups.