View Categories

Lineage Capture & Provenance Rules

3 min read

Minimum Required Metadata for Trust #


1 Purpose #

EA 2.0 doesn’t just connect data — it tells the story of that data.
Lineage and provenance give every node a narrative: how it was created, transformed, and used.
This allows any insight or automation to be explained, audited, and reproduced — the foundation of Responsible AI.


2 Key Principles #

PrincipleMeaning
End-to-End TraceabilityEvery fact can be traced from its source system to its business outcome.
Immutable EvidenceLineage is append-only — never overwritten.
Human and Machine ReadableGraph relationships describe lineage both semantically and visually.
Granular by DesignAt minimum, object-level lineage; optionally, field-level for regulated data.
Cross-Domain ContinuityConnects data movement across apps, cloud services, and processes.

3 Minimum Provenance Metadata (per Node or Edge) #

FieldDescriptionExample
source_systemSystem of originServiceNow, Azure Monitor
source_tableObject or API endpointcmdb_ci_app
extracted_atIngestion timestamp2025-11-08T09:30Z
transform_ruleApplied ETL logic or policymapAppID(), normalizeTags()
ownerSteward responsibleApp Owner
verified_byLast human validatorDQ Steward
lineage_pathUpstream chain hashcap-123 → app-A → data-Z
confidence0–1 trust score0.94

These attributes live on nodes and edges, forming a self-documenting web of provenance.


4 Core Lineage Relationships #

(:DataEntity)-[:DERIVED_FROM]->(:SourceData)
(:DataEntity)-[:TRANSFORMED_BY]->(:Process)
(:Process)-[:EXECUTED_ON]->(:System)
(:System)-[:OWNED_BY]->(:Person)
(:DataEntity)-[:FEEDS]->(:Application)

Queries like

“Show every process that transformed financial data before it reached the KPI dashboard.”
become one-hop traversals.


5 Capture Mechanisms #

StageMechanismTooling
IngestionExtract metadata headersAzure Data Factory / ADF Mapping Data Flows
TransformationAuto-generate lineage JSONFunctions / ETL scripts
Load (Graph)Write DERIVED_FROM edgesNeo4j Cypher UPSERT
Application UsageIntercept API calls / BI queriesLogic Apps / Power BI Usage API

EA 2.0 automatically builds lineage as data flows through these stages.


6 Provenance Validation Rules #

  1. Completeness Rule: Every node must have a source_system.
  2. Timestamp Rule: extracted_at ≤ 7 days old.
  3. Transform Disclosure Rule: All derived data must record transform_rule.
  4. Ownership Rule: Each object requires an owner.
  5. Verification Rule: If confidence < 0.8, trigger manual validation task.

Violations automatically raise DQ or GRC tickets.


7 Lineage Visualization #

  • Horizontal Flow: Source → Transformation → Storage → Consumption
  • Vertical Flow: Strategic Capability → Application → Data → Outcome
  • Color Coding: green = verified, yellow = manual step, red = missing link
  • Graph View: hover node → shows provenance attributes.

These views help architects answer: “What changed between version 2 and 3 of this data?”


8 Lineage Quality Metrics #

MetricDefinitionTarget
Lineage Completeness %Nodes with valid source_system≥ 95 %
Transformation Transparency %Processes with logged transform_rule≥ 90 %
Verification Coverage %Nodes with verified_by populated≥ 80 %
Average Confidence ScoreMean trust value≥ 0.9

Low scores automatically surface in DQ dashboards and drive stewardship actions.


9 Governance Integration #

  • ServiceNow GRC tasks auto-generated for lineage breaches.
  • Stewards verify via Teams forms linked to the graph.
  • Once verified, confidence updated and policy closed.
  • Quarterly audit compares lineage depth vs schema growth.

10 Security and Privacy #

  • Hash PII before storing lineage references.
  • Restrict edge visibility by role in Neo4j RBAC.
  • Encrypt lineage_path values in transit and at rest.
  • Use Azure Purview or Microsoft Fabric as federated catalogs for lineage federation.

11 Benefits #

✅ Transparency builds trust in AI recommendations.
✅ Auditors can verify every KPI back to its source table.
✅ Architects see dependencies that predict impact.
✅ Data Stewards gain ownership clarity.


12 Common Challenges & Mitigations #

ChallengeImpactMitigation
Legacy systems without metadata exportsLineage gapsUse proxy capture via ETL or API logs
Rapid schema changesBroken linksSchema versioning + drift alerts
Manual data uploadsUntracked sourcesGoverned drop folders + mandatory form metadata

13 Example Query #

MATCH (d:DataEntity)-[:DERIVED_FROM*1..4]->(s:SourceData)
WHERE d.name CONTAINS 'Customer'
RETURN s.source_system, count(*) AS hops;

→ Shows all systems that contribute to Customer data up to 4 hops back.


14 Takeaway #

Lineage is the truth engine of EA 2.0.
Without provenance, automation is just a guess. With it, AI and humans can trust each other’s work because every insight comes with a receipt.

Powered by BetterDocs

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top