EU AI Act Annex IV: Technical Documentation Requirements for High-Risk AI Systems
EU AI Act Article 11 requires providers of high-risk AI systems to maintain Annex IV technical documentation before placing the system on the EU market. Annex IV specifies eight categories of required content. This guide explains each section and what evidence auditors, notified bodies, and market surveillance authorities actually check. Annex IV documentation must remain current through all system updates — a substantial modification requires documentation update and potentially a new conformity assessment.
Annex IV § 1 and § 2: System Description and Development
Section 1 requires general description: intended purpose and deployment context (be specific — "score loan applications for credit risk" not "assist with lending"), software version with version history, hardware specifications, and all external systems and APIs the AI system interacts with. The most common § 1 gap: third-party model documentation. When a high-risk AI system uses an LLM API, the provider must document the foundation model as a component — including its version, provider, and capabilities. Section 2 requires the AI system architecture (components, reasoning approach, decision logic), training methodology, dataset specifications with demographic distribution, labeling methodology, and documentation of all pre-trained or third-party models used. For GPAI models: document whether the model provider has fulfilled their Article 53 obligations and what technical documentation they provided.
Annex IV § 3: Monitoring, Human Oversight, and Logging
Section 3 documents how the AI system is monitored after deployment and how Article 14 human oversight is operationalized. Required: description of oversight interfaces and tools, documentation that designated persons have authority to stop or override the AI system, escalation procedures, training requirements for oversight staff, and instructions to deployers. The logging sub-section requires: description of what events are logged, the log format and what each field contains, retention period (minimum 6 months under Article 12 but sector-specific requirements are often longer), tamper-evidence controls ensuring logs cannot be modified retroactively, and the process for making logs available to market surveillance authorities. The most common § 3 gap: application-level error logs only — no per-decision records linking inputs, reasoning, and outcomes for individual affected persons.
Annex IV § 4 and § 5: Performance Metrics and Testing
Section 4 requires justification for the chosen performance metrics given the system's specific task and risk profile — not just "95% accuracy" but why 95% is acceptable given who the errors affect and what happens to them. Demographic disaggregation is required: precision, recall, and error rates broken down by sex, race/ethnicity, and other protected attributes relevant to the use case. Section 5 requires test dataset specification, validation methodology, bias testing results (disparate impact analysis with selection rates by protected category), adversarial robustness testing, and complete test logs with individual results tied to specific system versions. The most common § 5 gap: no demographic disaggregation — overall accuracy metrics without any analysis of how performance varies across protected groups.
Annex IV § 6–8: Standards, Conformity, and Post-Market Monitoring
Section 6 lists harmonized EU standards applied (once published) or other frameworks such as ISO 42001 or NIST AI RMF. Section 7 is the EU Declaration of Conformity under Article 47 — a signed provider declaration affirming compliance, identifying the Annex III category, referencing the technical documentation, and identifying the notified body if third-party assessment was required. Section 8 is the post-market monitoring plan under Article 72: KPIs tracked after deployment, thresholds that trigger investigation, incident reporting triggers for Article 73 serious incident notifications, and the process for updating Annex IV when behavioral monitoring identifies material changes. The most common § 8 gap: no defined behavioral baselines — without documented expected behavior at deployment, drift cannot be detected and the post-market monitoring requirement cannot be satisfied.