View Categories

ETL / ELT Pipeline Design

4 min read

From Staging to Graph Load #


1 Purpose #

The integration patterns bring raw data; the ETL/ELT pipeline turns it into living intelligence.
In EA 2.0, pipelines are not one-off scripts — they are self-documenting, governed workflows that continuously feed the graph with trusted metadata.

The objective is simple: extract safely, transform minimally, load intelligently.


2 Pipeline Philosophy #

  1. Metadata-first: ingest structure and lineage before content.
  2. Idempotent by design: every run can repeat without duplication.
  3. Immutable logs: nothing disappears; every load leaves evidence.
  4. Policy-aware: pipelines obey data-sensitivity and residency rules.
  5. Decoupled: extraction, staging, transformation, and load are independent layers so failure never cascades.

3 Conceptual Flow #

Extract  →  Stage  →  Normalize/Map  →  Validate  →  Load  →  Audit

Each step emits events into an integration bus for observability and retry.


4 Typical Technology Stack #

LayerAzure ExampleAWS ExampleNotes
ExtractAzure Functions / Data Factory Copy ActivityLambda / Glue CrawlerPulls delta sets via API, DB, or file watcher
StageAzure Blob / ADLSS3 BucketStores raw JSON/CSV snapshots with timestamped folders
Transform / MapData Factory Data Flow / Synapse PipelineGlue Job / EMRStandardizes schema, applies taxonomy mapping
Load to GraphFunction → Neo4j REST / Cosmos GremlinLambda → Neptune Bulk LoaderUpserts nodes/edges using canonical ontology
Audit / LogLog Analytics / App InsightsCloudWatch / S3 LogsCentral visibility of lineage and errors

5 Staging Layer Design #

The staging zone is a compliance buffer:

  • Raw data stored verbatim with hash and timestamp.
  • Folder hierarchy: /system/yyyy/mm/dd/hh/
  • Access limited to ingestion service principal only.
  • Retention: 30 days (configurable).
    Benefits: rollback, replay, and forensic traceability.

6 Normalization & Mapping #

Transformation logic converts heterogeneous schemas into the EA 2.0 canonical model.

StepFunctionExample
Field MappingAlign source to target labelsapp_name → name, owner_id → person_ref
Type CastingNormalize datatypesstring → datetime
Lookup EnrichmentAdd capability or cost categoryjoin on reference tables
Label DerivationDetermine sensitivityif pii=truelabel='Confidential'
Relationship InversionCreate edgesApplication → Capability

Mappings are stored in JSON config (mapping.json) so new systems onboard via configuration, not code.


7 Validation & Quality Gates #

Every dataset passes automated gates before graph load:

GateCheckThreshold
CompletenessMandatory fields populated≥ 95 %
Freshnesslast_seen_at within SLA≤ 30 days
UniquenessNo duplicate keys0 duplicates
Referential IntegrityAll parent IDs exist100 %
Sensitivity ConsistencyLabel matches pattern100 %

Failed records go to a quarantine container for steward review.


8 Load to Graph #

The Graph Loader performs bulk upserts via REST or Cypher/Gremlin scripts.

Pseudo-logic:

for record in dataset:
    node_type = record["type"]
    id = record["natural_key"]
    merge_node(node_type, id, record["properties"])
    for rel in record["relations"]:
        merge_edge(rel["from"], rel["to"], rel["type"])

Features:

  • Handles partial updates (upsert only changed fields)
  • Automatically stamps source_system, load_id, timestamp
  • Uses transaction batches (size = 1000) for speed and rollback safety

Average throughput: > 500 records/sec on modest compute.


9 Audit & Lineage #

Each pipeline run emits a Load Manifest:

{
  "load_id": "EA2-2025-11-08T12:00Z",
  "source_system": "servicenow",
  "records_loaded": 4213,
  "records_failed": 12,
  "checksum": "b21f4e...",
  "duration_sec": 45,
  "operator": "function_app",
  "status": "success"
}

These manifests populate the Lineage Dashboard used by governance and AI-trust layers.


10 Error Handling & Replay #

  • Soft errors: (validation fails) → quarantined for steward correction.
  • Hard errors: (system fail) → retried automatically with exponential back-off.
  • Replays: re-run pipeline with load_id to restore missing records.
  • Alerts: pushed to Teams/ServiceNow when failure > threshold.

No manual intervention required until human judgment is actually needed.


11 Security Controls #

  • All pipelines run under managed identity with least privilege.
  • Data encrypted at rest (customer-managed keys).
  • Sensitive domains tokenized in staging; clear text restored only in secure graph zone.
  • Audit logs immutable (append-only).
  • Deployment via DevSecOps pipeline with approval gates.

12 Performance Optimisation #

TechniqueBenefit
Parallel Functions per domainLinearly scales throughput
Delta-window filter (updated_since)Reduces payload size
Stream compression (gzip JSON)Saves bandwidth
Pre-aggregate countsQuick validation before full load
Async graph commitsKeeps API latency < 300 ms

Result: daily end-to-end refresh of thousands of records in minutes.


13 KPIs & Monitoring #

KPIDescriptionTarget
Load Success Rate% of records loaded without error> 99 %
Average LatencyExtraction→Graph completion< 15 min
Quarantine Rate% records flagged< 5 %
Freshness Compliance% within SLA> 95 %
Rollback Success% replays restored cleanly100 %

Dashboards in Power BI or Grafana visualize pipeline health per source.


14 Implementation Example (Azure) #

[Azure Function Extract] 
   ↓ 
[Blob Stage] 
   ↓ 
[Data Factory Mapping Flow] 
   ↓ 
[Azure Function Graph Loader] 
   ↓ 
[Cosmos DB Gremlin API] 
   ↓ 
[Application Insights + ServiceNow Alerts]

All orchestrated under one Resource Group; managed identity flows end-to-end.


15 Takeaway #

The ETL/ELT pipeline is the circulatory system of EA 2.0.
It must be governed like code, monitored like infrastructure, and trusted like audit data.
When built this way, your graph doesn’t age — it refreshes itself.

Powered by BetterDocs

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top