What does Article 10 of the EU AI Act require for training data?

Article 10 requires that training, validation, and testing data for high-risk AI systems be relevant, sufficiently representative of the deployment population, as free of errors as possible, and appropriate to the intended purpose. Data governance practices must also address potential biases.

What is data leakage in AI training?

Data leakage occurs when the training data contains information that would not be available at prediction time. This causes artificially high accuracy during training that collapses in production, because the model learned to rely on information it will never have when making real predictions.

How do GDPR and the EU AI Act intersect on training data?

The EU AI Act's data governance requirements complement GDPR obligations. Including unnecessary personal data in training datasets violates GDPR's data minimization principle and creates additional compliance risk on top of AI Act obligations. Only data directly relevant to the model's task should be included.

AI Data Governance

Review training data quality against EU AI Act Article 10 requirements.

What Is AI Data Governance?

Learn how Article 10 of the EU AI Act establishes data governance requirements for high-risk AI systems. Review a training dataset for representativeness, data quality, leakage, and unnecessary personal data before model training can proceed.

What You'll Learn in AI Data Governance

Understand Article 10 requirements for training data quality in high-risk AI systems
Identify data representativeness issues that could lead to biased AI predictions
Recognize data leakage that artificially inflates model performance
Apply data minimization principles to reduce GDPR compliance risk in AI training data
Document data governance findings as required for high-risk AI compliance

AI Data Governance — Training Steps

Article 10: Data Governance

Article 10 of the EU AI Act establishes data governance requirements for high-risk AI systems. Training, validation, and testing data must meet strict quality criteria: Data must be relevant to the task the AI system is designed to perform. Data must be sufficiently representative of the population the model will serve. Data must be as free of errors as possible and appropriate to the intended purpose. Data governance practices must address potential biases that could lead to discriminatory outcomes. Poor data leads to biased AI, and biased AI leads to legal liability. Data governance is not a best practice under the EU AI Act - it is a legal obligation.
Dataset Review Request

An email arrives from Marcus Rodriguez, the AI Team Lead. The team is preparing to train ChurnPredict v3, and the dataset needs a compliance review before training can begin. The email links directly to the dataset on the DataOps platform.
Issue 1: Regional Underrepresentation

The DataOps platform loads the ChurnPredict v3 dataset review. The regional distribution of the training data immediately stands out - the dataset is heavily concentrated in one region despite the model being designed to serve all four equally.
Issue 2: Stale Pre-Pandemic Data

The data collection timeline reveals another concern. A significant portion of the records predate a fundamental shift in customer behavior.
Issue 3: Data Leakage

A closer look at the feature list reveals a critical data quality problem that would undermine the model entirely.
Knowledge Check: Data Representativeness

Before continuing the review, a question about the regional distribution issue.
Issue 4: Unnecessary Personal Data

The final section of the review reveals a compliance risk that extends beyond the AI Act into GDPR territory.
Review Summary

Alice has completed the data governance review. Four critical issues must be resolved before model training can proceed: Severe regional underrepresentation - 72% North region data for a model serving 4 regions equally. The dataset must be rebalanced to adequately represent all deployment regions. Stale pre-pandemic data - 38% of records from 2019-2020 no longer reflect current customer behavior. These records should be excluded or weighted appropriately. Data leakage - the account_status feature directly encodes the target variable and must be removed to prevent artificially inflated training accuracy. Unnecessary PII - raw names, emails, phone numbers, and addresses create GDPR exposure without contributing to churn prediction. These fields must be removed or pseudonymized.
File a Compliance Report

Identifying gaps is only half the job. Under Article 10, the data governance review must be documented and routed to the AI Team Lead and the Data Protection Officer so model training is paused until the issues are resolved.
Submit the Compliance Report

Alice fills in the report with the findings, the four gaps mapped to Article 10 and GDPR, and the actions the AI team must complete before training resumes.

What Is AI Data Governance?

What You'll Learn in AI Data Governance

AI Data Governance — Training Steps

Article 10: Data Governance

Dataset Review Request

Issue 1: Regional Underrepresentation

Issue 2: Stale Pre-Pandemic Data

Issue 3: Data Leakage

Knowledge Check: Data Representativeness

Issue 4: Unnecessary Personal Data

Review Summary

File a Compliance Report

Submit the Compliance Report