EU AI Act Article 10: Data and Data Governance for High-Risk AI Systems
EU AI Act Article 10 requires providers of high-risk AI systems to satisfy specific data governance obligations before placing their system on the EU market. Requirements include: training data representativeness relative to the deployment population, bias examination before deployment documenting both pre-mitigation findings and residual bias, labeling methodology documentation, and data governance practices covering collection, preprocessing, versioning, and access controls. Foundation model providers must be documented as data components. Behavioral baselines at deployment satisfy the post-deployment monitoring arm of Article 10(4).
Article 10(2): Training Data Requirements
Article 10(2) requires training, validation, and test datasets to be relevant and representative relative to the intended purpose. Representativeness requires demographic completeness — the data distribution must reflect the actual population and situations the system will encounter in deployment. Article 10(2)(c) explicitly requires appropriate statistical properties including proportionate representation of persons or groups. This is the legislative basis for requiring demographic disaggregation in data documentation — not just performance metrics but the composition of the data itself. Providers also must document labeling methodology (labeler instructions, quality controls, inter-rater reliability scores) and collection methodology.
Article 10(3): Bias Examination Before Market Placement
Article 10(3) requires providers to examine datasets for biases that are likely to affect health and safety or cause prohibited discrimination before placing the system on the EU market. The examination must cover the complete data pipeline: collection methodology, labeling procedures, preprocessing transformations, and final dataset composition. Bias examination documentation must record what bias detection methods were applied, what results were found, what mitigation measures were taken, and the residual bias present after mitigation. Post-mitigation results alone are insufficient — the process must be evidenced. Standard quantitative methods include demographic parity analysis (selection rates by protected attribute), equalized odds testing (error rates by protected attribute), and counterfactual fairness testing where applicable.
Article 10(4): Data Governance Practices
Article 10(4) requires written data governance practices covering the entire data pipeline: collection, preprocessing, versioning, access controls, and quality control. Dataset versioning is required — every dataset used for training, validation, or testing must be identifiable by version. Access audit trails must record who accessed datasets, when, and what operations were performed. Provenance documentation must trace each data source to its origin and the legal basis for collection. For systems using third-party datasets or foundation model providers, Article 10(4) requires documentation of the third party's data governance practices — not just a reference to their terms of service.
Article 10 for Foundation Model-Based Systems
Most enterprise AI systems use foundation models (GPT-4, Claude, Gemini, Llama) as components. Article 10 obligations apply to the complete system including foundation model components. Since providers cannot fully document training data they did not collect, the framework requires: documenting what foundation model providers disclose (model cards, data governance statements), conducting black-box bias testing on model outputs even when training data is inaccessible, fully documenting any fine-tuning data, and establishing behavioral baselines at deployment to detect behavioral drift after model provider updates. Foundation model version updates that change output behavior trigger Article 10(3) re-examination obligations.