AI Data Governance
Review training data quality against EU AI Act Article 10 requirements.
What Is AI Data Governance?
Learn how Article 10 of the EU AI Act establishes data governance requirements for high-risk AI systems. Review a training dataset for representativeness, data quality, leakage, and unnecessary personal data before model training can proceed.
What You'll Learn in AI Data Governance
- Understand Article 10 requirements for training data quality in high-risk AI systems
- Identify data representativeness issues that could lead to biased AI predictions
- Recognize data leakage that artificially inflates model performance
- Apply data minimization principles to reduce GDPR compliance risk in AI training data
- Document data governance findings as required for high-risk AI compliance
AI Data Governance — Training Steps
-
Article 10: Data Governance
Article 10 of the EU AI Act establishes data governance requirements for high-risk AI systems. Training, validation, and testing data must meet strict quality criteria: Data must be relevant to the task the AI system is designed to perform. Data must be sufficiently representative of the population the model will serve. Data must be as free of errors as possible and appropriate to the intended purpose. Data governance practices must address potential biases that could lead to discriminatory outcomes. Poor data leads to biased AI, and biased AI leads to legal liability. Data governance is not a best practice under the EU AI Act - it is a legal obligation.
-
Dataset Review Request
An email arrives from Marcus Rodriguez, the AI Team Lead. The team is preparing to train ChurnPredict v3, and the dataset needs a compliance review before training can begin. The email links directly to the dataset on the DataOps platform.
-
Issue 1: Regional Underrepresentation
The DataOps platform loads the ChurnPredict v3 dataset review. The regional distribution of the training data immediately stands out - the dataset is heavily concentrated in one region despite the model being designed to serve all four equally.
-
Issue 2: Stale Pre-Pandemic Data
The data collection timeline reveals another concern. A significant portion of the records predate a fundamental shift in customer behavior.
-
Issue 3: Data Leakage
A closer look at the feature list reveals a critical data quality problem that would undermine the model entirely.
-
Knowledge Check: Data Representativeness
Before continuing the review, a question about the regional distribution issue.
-
Issue 4: Unnecessary Personal Data
The final section of the review reveals a compliance risk that extends beyond the AI Act into GDPR territory.
-
Review Summary
Alice has completed the data governance review. Four critical issues must be resolved before model training can proceed: Severe regional underrepresentation - 72% North region data for a model serving 4 regions equally. The dataset must be rebalanced to adequately represent all deployment regions. Stale pre-pandemic data - 38% of records from 2019-2020 no longer reflect current customer behavior. These records should be excluded or weighted appropriately. Data leakage - the account_status feature directly encodes the target variable and must be removed to prevent artificially inflated training accuracy. Unnecessary PII - raw names, emails, phone numbers, and addresses create GDPR exposure without contributing to churn prediction. These fields must be removed or pseudonymized.
-
File a Compliance Report
Identifying gaps is only half the job. Under Article 10, the data governance review must be documented and routed to the AI Team Lead and the Data Protection Officer so model training is paused until the issues are resolved.
-
Submit the Compliance Report
Alice fills in the report with the findings, the four gaps mapped to Article 10 and GDPR, and the actions the AI team must complete before training resumes.