- From Staging to Graph Load
- 1 Purpose
- 2 Pipeline Philosophy
- 3 Conceptual Flow
- 4 Typical Technology Stack
- 5 Staging Layer Design
- 6 Normalization & Mapping
- 7 Validation & Quality Gates
- 8 Load to Graph
- 9 Audit & Lineage
- 10 Error Handling & Replay
- 11 Security Controls
- 12 Performance Optimisation
- 13 KPIs & Monitoring
- 14 Implementation Example (Azure)
- 15 Takeaway
From Staging to Graph Load #
1 Purpose #
The integration patterns bring raw data; the ETL/ELT pipeline turns it into living intelligence.
In EA 2.0, pipelines are not one-off scripts — they are self-documenting, governed workflows that continuously feed the graph with trusted metadata.
The objective is simple: extract safely, transform minimally, load intelligently.
2 Pipeline Philosophy #
- Metadata-first: ingest structure and lineage before content.
- Idempotent by design: every run can repeat without duplication.
- Immutable logs: nothing disappears; every load leaves evidence.
- Policy-aware: pipelines obey data-sensitivity and residency rules.
- Decoupled: extraction, staging, transformation, and load are independent layers so failure never cascades.
3 Conceptual Flow #
Extract → Stage → Normalize/Map → Validate → Load → Audit
Each step emits events into an integration bus for observability and retry.
4 Typical Technology Stack #
| Layer | Azure Example | AWS Example | Notes |
|---|---|---|---|
| Extract | Azure Functions / Data Factory Copy Activity | Lambda / Glue Crawler | Pulls delta sets via API, DB, or file watcher |
| Stage | Azure Blob / ADLS | S3 Bucket | Stores raw JSON/CSV snapshots with timestamped folders |
| Transform / Map | Data Factory Data Flow / Synapse Pipeline | Glue Job / EMR | Standardizes schema, applies taxonomy mapping |
| Load to Graph | Function → Neo4j REST / Cosmos Gremlin | Lambda → Neptune Bulk Loader | Upserts nodes/edges using canonical ontology |
| Audit / Log | Log Analytics / App Insights | CloudWatch / S3 Logs | Central visibility of lineage and errors |
5 Staging Layer Design #
The staging zone is a compliance buffer:
- Raw data stored verbatim with hash and timestamp.
- Folder hierarchy:
/system/yyyy/mm/dd/hh/ - Access limited to ingestion service principal only.
- Retention: 30 days (configurable).
Benefits: rollback, replay, and forensic traceability.
6 Normalization & Mapping #
Transformation logic converts heterogeneous schemas into the EA 2.0 canonical model.
| Step | Function | Example |
|---|---|---|
| Field Mapping | Align source to target labels | app_name → name, owner_id → person_ref |
| Type Casting | Normalize datatypes | string → datetime |
| Lookup Enrichment | Add capability or cost category | join on reference tables |
| Label Derivation | Determine sensitivity | if pii=true → label='Confidential' |
| Relationship Inversion | Create edges | Application → Capability |
Mappings are stored in JSON config (mapping.json) so new systems onboard via configuration, not code.
7 Validation & Quality Gates #
Every dataset passes automated gates before graph load:
| Gate | Check | Threshold |
|---|---|---|
| Completeness | Mandatory fields populated | ≥ 95 % |
| Freshness | last_seen_at within SLA | ≤ 30 days |
| Uniqueness | No duplicate keys | 0 duplicates |
| Referential Integrity | All parent IDs exist | 100 % |
| Sensitivity Consistency | Label matches pattern | 100 % |
Failed records go to a quarantine container for steward review.
8 Load to Graph #
The Graph Loader performs bulk upserts via REST or Cypher/Gremlin scripts.
Pseudo-logic:
for record in dataset:
node_type = record["type"]
id = record["natural_key"]
merge_node(node_type, id, record["properties"])
for rel in record["relations"]:
merge_edge(rel["from"], rel["to"], rel["type"])
Features:
- Handles partial updates (upsert only changed fields)
- Automatically stamps
source_system,load_id,timestamp - Uses transaction batches (size = 1000) for speed and rollback safety
Average throughput: > 500 records/sec on modest compute.
9 Audit & Lineage #
Each pipeline run emits a Load Manifest:
{
"load_id": "EA2-2025-11-08T12:00Z",
"source_system": "servicenow",
"records_loaded": 4213,
"records_failed": 12,
"checksum": "b21f4e...",
"duration_sec": 45,
"operator": "function_app",
"status": "success"
}
These manifests populate the Lineage Dashboard used by governance and AI-trust layers.
10 Error Handling & Replay #
- Soft errors: (validation fails) → quarantined for steward correction.
- Hard errors: (system fail) → retried automatically with exponential back-off.
- Replays: re-run pipeline with
load_idto restore missing records. - Alerts: pushed to Teams/ServiceNow when failure > threshold.
No manual intervention required until human judgment is actually needed.
11 Security Controls #
- All pipelines run under managed identity with least privilege.
- Data encrypted at rest (customer-managed keys).
- Sensitive domains tokenized in staging; clear text restored only in secure graph zone.
- Audit logs immutable (append-only).
- Deployment via DevSecOps pipeline with approval gates.
12 Performance Optimisation #
| Technique | Benefit |
|---|---|
| Parallel Functions per domain | Linearly scales throughput |
Delta-window filter (updated_since) | Reduces payload size |
| Stream compression (gzip JSON) | Saves bandwidth |
| Pre-aggregate counts | Quick validation before full load |
| Async graph commits | Keeps API latency < 300 ms |
Result: daily end-to-end refresh of thousands of records in minutes.
13 KPIs & Monitoring #
| KPI | Description | Target |
|---|---|---|
| Load Success Rate | % of records loaded without error | > 99 % |
| Average Latency | Extraction→Graph completion | < 15 min |
| Quarantine Rate | % records flagged | < 5 % |
| Freshness Compliance | % within SLA | > 95 % |
| Rollback Success | % replays restored cleanly | 100 % |
Dashboards in Power BI or Grafana visualize pipeline health per source.
14 Implementation Example (Azure) #
[Azure Function Extract]
↓
[Blob Stage]
↓
[Data Factory Mapping Flow]
↓
[Azure Function Graph Loader]
↓
[Cosmos DB Gremlin API]
↓
[Application Insights + ServiceNow Alerts]
All orchestrated under one Resource Group; managed identity flows end-to-end.
15 Takeaway #
The ETL/ELT pipeline is the circulatory system of EA 2.0.
It must be governed like code, monitored like infrastructure, and trusted like audit data.
When built this way, your graph doesn’t age — it refreshes itself.