ETL / ELT Pipeline Design

4 min read

Table of Contents

From Staging to Graph Load
1 Purpose
2 Pipeline Philosophy
3 Conceptual Flow
4 Typical Technology Stack
5 Staging Layer Design
6 Normalization & Mapping
7 Validation & Quality Gates
8 Load to Graph
9 Audit & Lineage
10 Error Handling & Replay
11 Security Controls
12 Performance Optimisation
13 KPIs & Monitoring
14 Implementation Example (Azure)
15 Takeaway

From Staging to Graph Load #

1 Purpose #

The integration patterns bring raw data; the ETL/ELT pipeline turns it into living intelligence.
In EA 2.0, pipelines are not one-off scripts — they are self-documenting, governed workflows that continuously feed the graph with trusted metadata.

The objective is simple: extract safely, transform minimally, load intelligently.

2 Pipeline Philosophy #

Metadata-first: ingest structure and lineage before content.
Idempotent by design: every run can repeat without duplication.
Immutable logs: nothing disappears; every load leaves evidence.
Policy-aware: pipelines obey data-sensitivity and residency rules.
Decoupled: extraction, staging, transformation, and load are independent layers so failure never cascades.

3 Conceptual Flow #

Extract  →  Stage  →  Normalize/Map  →  Validate  →  Load  →  Audit

Each step emits events into an integration bus for observability and retry.

4 Typical Technology Stack #

Layer	Azure Example	AWS Example	Notes
Extract	Azure Functions / Data Factory Copy Activity	Lambda / Glue Crawler	Pulls delta sets via API, DB, or file watcher
Stage	Azure Blob / ADLS	S3 Bucket	Stores raw JSON/CSV snapshots with timestamped folders
Transform / Map	Data Factory Data Flow / Synapse Pipeline	Glue Job / EMR	Standardizes schema, applies taxonomy mapping
Load to Graph	Function → Neo4j REST / Cosmos Gremlin	Lambda → Neptune Bulk Loader	Upserts nodes/edges using canonical ontology
Audit / Log	Log Analytics / App Insights	CloudWatch / S3 Logs	Central visibility of lineage and errors

5 Staging Layer Design #

The staging zone is a compliance buffer:

Raw data stored verbatim with hash and timestamp.
Folder hierarchy: /system/yyyy/mm/dd/hh/
Access limited to ingestion service principal only.
Retention: 30 days (configurable).
Benefits: rollback, replay, and forensic traceability.

6 Normalization & Mapping #

Transformation logic converts heterogeneous schemas into the EA 2.0 canonical model.

Step	Function	Example
Field Mapping	Align source to target labels	`app_name → name`, `owner_id → person_ref`
Type Casting	Normalize datatypes	`string → datetime`
Lookup Enrichment	Add capability or cost category	join on reference tables
Label Derivation	Determine sensitivity	if `pii=true` → `label='Confidential'`
Relationship Inversion	Create edges	Application → Capability

Mappings are stored in JSON config (mapping.json) so new systems onboard via configuration, not code.

7 Validation & Quality Gates #

Every dataset passes automated gates before graph load:

Gate	Check	Threshold
Completeness	Mandatory fields populated	≥ 95 %
Freshness	`last_seen_at` within SLA	≤ 30 days
Uniqueness	No duplicate keys	0 duplicates
Referential Integrity	All parent IDs exist	100 %
Sensitivity Consistency	Label matches pattern	100 %

Failed records go to a quarantine container for steward review.

8 Load to Graph #

The Graph Loader performs bulk upserts via REST or Cypher/Gremlin scripts.

Pseudo-logic:

for record in dataset:
    node_type = record["type"]
    id = record["natural_key"]
    merge_node(node_type, id, record["properties"])
    for rel in record["relations"]:
        merge_edge(rel["from"], rel["to"], rel["type"])

Features:

Handles partial updates (upsert only changed fields)
Automatically stamps source_system, load_id, timestamp
Uses transaction batches (size = 1000) for speed and rollback safety

Average throughput: > 500 records/sec on modest compute.

9 Audit & Lineage #

Each pipeline run emits a Load Manifest:

{
  "load_id": "EA2-2025-11-08T12:00Z",
  "source_system": "servicenow",
  "records_loaded": 4213,
  "records_failed": 12,
  "checksum": "b21f4e...",
  "duration_sec": 45,
  "operator": "function_app",
  "status": "success"
}

These manifests populate the Lineage Dashboard used by governance and AI-trust layers.

10 Error Handling & Replay #

Soft errors: (validation fails) → quarantined for steward correction.
Hard errors: (system fail) → retried automatically with exponential back-off.
Replays: re-run pipeline with load_id to restore missing records.
Alerts: pushed to Teams/ServiceNow when failure > threshold.

No manual intervention required until human judgment is actually needed.

11 Security Controls #

All pipelines run under managed identity with least privilege.
Data encrypted at rest (customer-managed keys).
Sensitive domains tokenized in staging; clear text restored only in secure graph zone.
Audit logs immutable (append-only).
Deployment via DevSecOps pipeline with approval gates.

12 Performance Optimisation #

Technique	Benefit
Parallel Functions per domain	Linearly scales throughput
Delta-window filter (`updated_since`)	Reduces payload size
Stream compression (gzip JSON)	Saves bandwidth
Pre-aggregate counts	Quick validation before full load
Async graph commits	Keeps API latency < 300 ms

Result: daily end-to-end refresh of thousands of records in minutes.

13 KPIs & Monitoring #

KPI	Description	Target
Load Success Rate	% of records loaded without error	> 99 %
Average Latency	Extraction→Graph completion	< 15 min
Quarantine Rate	% records flagged	< 5 %
Freshness Compliance	% within SLA	> 95 %
Rollback Success	% replays restored cleanly	100 %

Dashboards in Power BI or Grafana visualize pipeline health per source.

14 Implementation Example (Azure) #

[Azure Function Extract] 
   ↓ 
[Blob Stage] 
   ↓ 
[Data Factory Mapping Flow] 
   ↓ 
[Azure Function Graph Loader] 
   ↓ 
[Cosmos DB Gremlin API] 
   ↓ 
[Application Insights + ServiceNow Alerts]

All orchestrated under one Resource Group; managed identity flows end-to-end.

15 Takeaway #

The ETL/ELT pipeline is the circulatory system of EA 2.0.
It must be governed like code, monitored like infrastructure, and trusted like audit data.
When built this way, your graph doesn’t age — it refreshes itself.

What are your Feelings

Still stuck? How can we help?

Updated on November 9, 2025

Overview & Principles

Data Sourcing & Integration

Reasoning & Intelligence Layer

Outbound Actions & Governance

Data Quality, Lineage & Ontology

Platform Implementation

Governance, Roles & Operations

Reference Assets & Visual Library

Implementation Playbooks

FAQ & Troubleshooting

ETL / ELT Pipeline Design

From Staging to Graph Load #

1 Purpose #

2 Pipeline Philosophy #

3 Conceptual Flow #

4 Typical Technology Stack #

5 Staging Layer Design #

6 Normalization & Mapping #

7 Validation & Quality Gates #

8 Load to Graph #

9 Audit & Lineage #

10 Error Handling & Replay #

11 Security Controls #

12 Performance Optimisation #

13 KPIs & Monitoring #

14 Implementation Example (Azure) #

15 Takeaway #

What are your Feelings

Leave a Reply Cancel reply

From Staging to Graph Load #

1 Purpose #

2 Pipeline Philosophy #

3 Conceptual Flow #

4 Typical Technology Stack #

5 Staging Layer Design #

6 Normalization & Mapping #

7 Validation & Quality Gates #

8 Load to Graph #

9 Audit & Lineage #

10 Error Handling & Replay #

11 Security Controls #

12 Performance Optimisation #

13 KPIs & Monitoring #

14 Implementation Example (Azure) #

15 Takeaway #

What are your Feelings

Share This Article :

How can we help?

Leave a Reply Cancel reply