Common Integration Errors • Schema Drift • Performance • Security Exceptions #
1 Purpose #
Even intelligent architectures occasionally hiccup. This guide captures the most frequent symptoms you’ll see in production EA 2.0 deployments and provides quick, deterministic fixes—no guesswork, no panic.
2 Integration & Connectivity Issues #
Symptom Likely Cause Resolution Feed not loading from CMDB / Cloud API Token expired / wrong scope Re-authorize via Managed Identity or refresh OAuth token. ETL fails mid-pipeline Schema version mismatch Compare source JSON to Data Contract; update mapping in Transform stage. “403 Forbidden” from ServiceNow API IP not whitelisted or wrong user role Add Function App IP to SN allow list; assign sn_api_integration role. Duplicate records in Graph No unique source_id field Add UUID hash per row in Normalize function. Missing data for new applications Source feed not incremental Enable “delta sync” and set last_updated_at filter.
3 Schema Drift & Data Quality #
Symptom Root Cause Fix Pattern “Key not found” error in loader New field introduced in source schema Update Ontology + Mapping file vNext. Graph edges missing Relationship type renamed / removed Run Schema Validator job → re-infer relationships. Stale data detected > 7 days Cron job failed or paused Function Check Timer Trigger logs and restart. DQ Score drop < 0.7 Feed contains nulls / invalid IDs Re-apply DQ rules; trigger Steward task. Policy evaluation wrong Policy JSON uses old taxonomy Sync taxonomy table from master repo.
4 Performance and Cost #
Symptom Root Cause Mitigation Action Slow NLQ responses Graph query unindexed Add index on name and type fields. High Cosmos RU/s consumption Over-fetching entire graph Use pagination & LIMIT 200 in Cypher. ADF pipeline timeout Long transform logic Split into two pipelines (Extract + Transform). Power BI refresh too slow Dataset too large / complex joins Use DirectQuery + aggregate views. Storage cost spike Logs not archived Apply Lifecycle Policy → move > 90 day files to Cool tier.
5 Predictive Layer & AI Errors #
Symptom Possible Reason Solution Model accuracy drop Data drift / unbalanced training set Retrain with latest quarter data; check feature weights. RAG answers irrelevant Embeddings outdated / vector index stale Re-embed content via scheduled job. Prompt timed out Token limit or LLM latency Reduce context size / use async FastAPI. Guardrail blocked response Sensitive term policy triggered Review Prompt Library for safe phrasing.
6 Governance & Security Exceptions #
Symptom Detection Source Action Required Unauthorized graph write Audit Ledger alert Revoke token; review RBAC logs. Policy auto-remediation failed Azure Policy error code Retry Logic App with service principal rights. Evidence missing for control GRC sync job failed Manually upload evidence → rerun API call. Audit log tampering attempt WORM write violation detected Lock container and notify Compliance Officer.
7 User & Access Problems #
Symptom Root Cause Remedy User cannot log in to NLQ UI Entra ID token expired Force reauth / refresh token policy. Dashboard blank for some users Missing Power BI dataset permission Add to workspace security group. “Access Denied” in graph API Role = Viewer (need Analyst) Promote RBAC role via Portal.
8 Audit Trail Verification #
Checklist: ☑ Audit Ledger hash verified weekly. ☑ Evidence files archived to immutable storage. ☑ Policy change log signed and timestamped. ☑ Sentinel integration report generated monthly.
If any box fails, escalate to Compliance Lead within 24 h.
9 Diagnostics Commands (CLI Examples) #
# Check graph ingestion jobs
az functionapp log tail --name EA2IngestFn --resource-group EA2GovRG
# Validate Cosmos DB throughput
az cosmosdb sql container throughput show --account-name EA2Graph --name nodes --resource-group EA2GovRG
# Verify Power BI refresh history
Get-PowerBIDataset -Name "EA2_Metrics" | Get-PowerBIRefreshHistory
10 Escalation Matrix #
Severity Example Incident Owner Response Time Critical Graph unavailable > 1 h / security breach EA Ops Manager + CISO 1 h immediate bridge call High Major integration failure Integration Lead + Data Steward 4 h Medium Dashboard delay / minor DQ error Service Manager 1 business day Low Cosmetic UI issue Product Owner Next release cycle
11 Knowledge Refresh & Self-Healing #
EA 2.0 continuously learns from its own tickets:
Closed incidents → fed back into Reasoning API for pattern recognition.
Repeated errors → policy review task auto-created.
MTTR trends → feed Predictive Cost & Risk models.
12 Preventive Maintenance Checklist #
Frequency Task Responsible Role Daily Check ingest logs & DQ scores Data Steward Weekly Review policy breaches & closure rate Policy Owner Monthly Validate backup restore + Power BI sync Service Manager Quarterly Retrain predictive models & update ontology EA Architect Yearly Full audit simulation & disaster recovery test Compliance Officer
13 Takeaway #
Every architecture issue is just data that hasn’t been learned from yet. EA 2.0 turns troubleshooting into training — each resolution strengthens the system’s collective intelligence.