AI Readiness Guidelines
(Draft)
Version: 1.0.0
Date: 2024-10-18
We will work with EDRN sites submitting data to LabCAS to ensure data meets AI-Readiness guidelines as much as possible.
Having data AI-ready means that the data is prepared and optimized to be effectively used by artificial intelligence (AI) and machine learning (ML) models. This involves a combination of technical and structural considerations to ensure that the data is clean, organized, and structured in a way that supports AI workflows.
When submitting your data consider these key factors:
1. Quality and Cleanliness
- Consistency: Data should be free from errors, inconsistencies, and duplicates.
- Completeness: Missing values should be addressed or imputed, ensuring that the dataset is as complete as possible.
- Accuracy: Data should accurately represent the real-world entities it describes, with minimal errors.
2. Structured and Standardized
- Format: Data should be in a format that AI models can easily process (e.g., structured tables, numerical arrays, or properly annotated images/text).
- Standardization: Data should follow consistent standards, such as common taxonomies, consistent units of measure, and standardized formats (e.g., ISO date formats).
- Normalization: Data should be scaled and normalized where appropriate to ensure that numerical values are on a similar scale for model input.
3. Sufficient Volume and Diversity
- Quantity: AI models, especially deep learning models, often require large amounts of data to perform well. Having sufficient volume allows for better model training and generalization.
- Diversity: The dataset should cover a wide variety of cases or examples to ensure that the model can generalize well and is not biased toward specific subsets of data.
4. Labeled and Annotated Data
- Labels: In the case of supervised learning, the data should be properly labeled with relevant tags, categories, or outcomes.
- Annotations: For images, text, or other unstructured data types, annotations (e.g., bounding boxes, text tags, or other indicators) should be clear and accurate.
5. Interoperability and Accessibility
- Interoperable Formats: Data should be stored in formats that are easily accessible and compatible with AI tools and libraries (e.g., CSV, JSON, HDF5, or parquet for structured data).
- APIs and Data Pipelines: Data should be easily accessible through APIs or pipelines that allow for automated or real-time data ingestion into AI models.
6. Ethical and Compliant (Governance)
- Bias Mitigation: Ensure the data is free from harmful biases that could lead to unfair AI model outcomes.
- Legal Compliance: The data should comply with regulations, such as GDPR (General Data Protection Regulation), HIPAA (Health Insurance Portability and Accountability Act), or other relevant privacy laws.
7. Metadata and Documentation
- Metadata: Rich metadata should be included to describe the data’s origin, structure, and intended use. This helps AI developers understand and use the data effectively.
- Data Provenance: Documenting the source, collection method, and any transformations applied to the data helps maintain data integrity and traceability.