- Minimum Required Metadata for Trust
- 1 Purpose
- 2 Key Principles
- 3 Minimum Provenance Metadata (per Node or Edge)
- 4 Core Lineage Relationships
- 5 Capture Mechanisms
- 6 Provenance Validation Rules
- 7 Lineage Visualization
- 8 Lineage Quality Metrics
- 9 Governance Integration
- 10 Security and Privacy
- 11 Benefits
- 12 Common Challenges & Mitigations
- 13 Example Query
- 14 Takeaway
Minimum Required Metadata for Trust #
1 Purpose #
EA 2.0 doesn’t just connect data — it tells the story of that data.
Lineage and provenance give every node a narrative: how it was created, transformed, and used.
This allows any insight or automation to be explained, audited, and reproduced — the foundation of Responsible AI.
2 Key Principles #
| Principle | Meaning |
|---|---|
| End-to-End Traceability | Every fact can be traced from its source system to its business outcome. |
| Immutable Evidence | Lineage is append-only — never overwritten. |
| Human and Machine Readable | Graph relationships describe lineage both semantically and visually. |
| Granular by Design | At minimum, object-level lineage; optionally, field-level for regulated data. |
| Cross-Domain Continuity | Connects data movement across apps, cloud services, and processes. |
3 Minimum Provenance Metadata (per Node or Edge) #
| Field | Description | Example |
|---|---|---|
source_system | System of origin | ServiceNow, Azure Monitor |
source_table | Object or API endpoint | cmdb_ci_app |
extracted_at | Ingestion timestamp | 2025-11-08T09:30Z |
transform_rule | Applied ETL logic or policy | mapAppID(), normalizeTags() |
owner | Steward responsible | App Owner |
verified_by | Last human validator | DQ Steward |
lineage_path | Upstream chain hash | cap-123 → app-A → data-Z |
confidence | 0–1 trust score | 0.94 |
These attributes live on nodes and edges, forming a self-documenting web of provenance.
4 Core Lineage Relationships #
(:DataEntity)-[:DERIVED_FROM]->(:SourceData)
(:DataEntity)-[:TRANSFORMED_BY]->(:Process)
(:Process)-[:EXECUTED_ON]->(:System)
(:System)-[:OWNED_BY]->(:Person)
(:DataEntity)-[:FEEDS]->(:Application)
Queries like
“Show every process that transformed financial data before it reached the KPI dashboard.”
become one-hop traversals.
5 Capture Mechanisms #
| Stage | Mechanism | Tooling |
|---|---|---|
| Ingestion | Extract metadata headers | Azure Data Factory / ADF Mapping Data Flows |
| Transformation | Auto-generate lineage JSON | Functions / ETL scripts |
| Load (Graph) | Write DERIVED_FROM edges | Neo4j Cypher UPSERT |
| Application Usage | Intercept API calls / BI queries | Logic Apps / Power BI Usage API |
EA 2.0 automatically builds lineage as data flows through these stages.
6 Provenance Validation Rules #
- Completeness Rule: Every node must have a
source_system. - Timestamp Rule:
extracted_at≤ 7 days old. - Transform Disclosure Rule: All derived data must record
transform_rule. - Ownership Rule: Each object requires an
owner. - Verification Rule: If confidence < 0.8, trigger manual validation task.
Violations automatically raise DQ or GRC tickets.
7 Lineage Visualization #
- Horizontal Flow: Source → Transformation → Storage → Consumption
- Vertical Flow: Strategic Capability → Application → Data → Outcome
- Color Coding: green = verified, yellow = manual step, red = missing link
- Graph View: hover node → shows provenance attributes.
These views help architects answer: “What changed between version 2 and 3 of this data?”
8 Lineage Quality Metrics #
| Metric | Definition | Target |
|---|---|---|
| Lineage Completeness % | Nodes with valid source_system | ≥ 95 % |
| Transformation Transparency % | Processes with logged transform_rule | ≥ 90 % |
| Verification Coverage % | Nodes with verified_by populated | ≥ 80 % |
| Average Confidence Score | Mean trust value | ≥ 0.9 |
Low scores automatically surface in DQ dashboards and drive stewardship actions.
9 Governance Integration #
- ServiceNow GRC tasks auto-generated for lineage breaches.
- Stewards verify via Teams forms linked to the graph.
- Once verified,
confidenceupdated and policy closed. - Quarterly audit compares lineage depth vs schema growth.
10 Security and Privacy #
- Hash PII before storing lineage references.
- Restrict edge visibility by role in Neo4j RBAC.
- Encrypt
lineage_pathvalues in transit and at rest. - Use Azure Purview or Microsoft Fabric as federated catalogs for lineage federation.
11 Benefits #
✅ Transparency builds trust in AI recommendations.
✅ Auditors can verify every KPI back to its source table.
✅ Architects see dependencies that predict impact.
✅ Data Stewards gain ownership clarity.
12 Common Challenges & Mitigations #
| Challenge | Impact | Mitigation |
|---|---|---|
| Legacy systems without metadata exports | Lineage gaps | Use proxy capture via ETL or API logs |
| Rapid schema changes | Broken links | Schema versioning + drift alerts |
| Manual data uploads | Untracked sources | Governed drop folders + mandatory form metadata |
13 Example Query #
MATCH (d:DataEntity)-[:DERIVED_FROM*1..4]->(s:SourceData)
WHERE d.name CONTAINS 'Customer'
RETURN s.source_system, count(*) AS hops;
→ Shows all systems that contribute to Customer data up to 4 hops back.
14 Takeaway #
Lineage is the truth engine of EA 2.0.
Without provenance, automation is just a guess. With it, AI and humans can trust each other’s work because every insight comes with a receipt.