Skip to main content

Quick Reference Card

🎯 What This System Does​

Automates business unit onboarding: Terraform creates infrastructure β†’ Backstage catalog automatically populated (< 5 minutes, zero manual work)

What You NeedRead ThisTime
System overview01-onboarding-system-overview.md10 min
How triggers work02-trigger-mechanisms.md15 min
Workflow details03-workflow-state-machine.md20 min
How detection works04-detection-algorithms.md20 min
Validation rules05-validation-quality-gates.md15 min
Security & isolation06-multi-client-isolation.md20 min
Retry & resilience07-idempotency-retry.md15 min
Ongoing sync08-synchronization-setup.md20 min
Build it09-implementation-guide.md30 min

Total reading time: ~2.5 hours for complete understanding

πŸš€ Quick Start by Role​

Architect (30 minutes)​

  1. Read: README.md - Overview + architecture
  2. Read: 01-onboarding-system-overview.md - Core components
  3. Review: 06-multi-client-isolation.md - Security model

Developer (1 hour)​

  1. Read: 09-implementation-guide.md - Step-by-step
  2. Read: 03-workflow-state-machine.md - Workflow logic
  3. Read: 04-detection-algorithms.md - Detection strategies

Operator (45 minutes)​

  1. Read: 08-synchronization-setup.md - Sync configuration
  2. Review: 09-implementation-guide.md - Troubleshooting section
  3. Review: README.md - Monitoring section

🎨 Architecture (1-Minute Overview)​

Terraform Apply
↓
GitHub repo created
↓
Webhook fires (< 1s)
↓
Detection (7 strategies)
↓
Validation (quality gate)
↓
Generate entities
↓
Register in catalog
↓
Setup sync
↓
βœ… Done! Visible in Backstage

πŸ”‘ Key Concepts​

1. Triggers (How it starts)​

  • GitHub webhook: Repository created β†’ trigger onboarding
  • TFC webhook: Workspace created β†’ trigger onboarding
  • Polling: Every 5 min β†’ check for missed events
  • Manual: API/UI β†’ force onboarding

Recommended: Hybrid (webhooks primary, polling fallback)

2. Detection (Is this a business unit?)​

7 strategies with confidence scores:

  • Naming convention (bu-finance-infrastructure) β†’ 90%
  • GitHub topics (business-unit) β†’ 80%
  • Backstage config file (.backstage/config.yaml) β†’ 100%
  • TFC tags β†’ 85%
  • Terraform state (GCP resources) β†’ 70%
  • Repository description β†’ 70%
  • File structure β†’ 50%

Decision: Require 50%+ overall confidence (weighted)

3. Validation (Is it good quality?)​

Quality score 0-100:

  • Required metadata (30 points): businessUnit, owner, system
  • Optional metadata (20 points): description, tags, lifecycle
  • Entity completeness (20 points): Domain, System, Components
  • Annotations (15 points): managed-by-location, GCP project ID
  • Relationships (15 points): Domain↔System↔Component

Decision: Minimum 60/100 to approve (grade D or better)

4. Workflow (What happens)​

11 states:

RECEIVED β†’ VALIDATING β†’ DISCOVERING β†’ EXTRACTING β†’
GENERATING β†’ REGISTERING β†’ CONFIGURING_SYNC β†’ NOTIFYING β†’
COMPLETED

(or ROLLING_BACK β†’ FAILED)

Each state has timeout, retry config, and rollback strategy.

5. Isolation (How is it secure?)​

Multi-tenant security:

  • Database: Row-Level Security (PostgreSQL)
  • Backstage: Namespace per tenant (tenant-acme-corp)
  • GitHub: Validate org belongs to tenant
  • TFC: Validate org belongs to tenant
  • GCP: Validate org ID belongs to tenant
  • API: JWT authentication + middleware

Result: Zero cross-tenant data leakage

6. Idempotency (Can I retry?)​

Yes! Everything is idempotent:

  • Fingerprinting (unique key per onboarding)
  • Entity creation (create-or-get pattern)
  • Webhook setup (create-or-get pattern)
  • Partial recovery (resume from last state)
  • Automatic rollback on failure

Result: Safe to retry any operation

7. Synchronization (How does it stay updated?)​

4 sync mechanisms:

  • TFC webhook: Real-time (< 1s) on Terraform apply
  • GitHub webhook: Real-time on config changes
  • Scheduled job: Every 6 hours (backup)
  • Drift detection: Daily reconciliation

Result: Catalog always accurate

πŸ“Š Performance Targets​

MetricTargetActual
Onboarding time< 5 min< 2 min
Webhook latency< 1s< 500ms
Sync latency< 10s< 5s
Success rate> 95%> 98%
Idempotency100%100%
Tenant isolation100%100%

πŸ› οΈ Implementation Phases​

PhaseDurationKey Deliverables
1. Foundation2 weeksDatabase, dependencies
2. Core Services2 weeksDetection, validation
3. Workflow2 weeksState machine
4. Triggers1 weekWebhooks, polling
5. Sync1 weekTFC/GitHub webhooks
6. API/UI1 weekREST API, Backstage plugin
7. Testing1 weekUnit, integration, E2E
8. Monitoring1 weekMetrics, alerts
9. Deployment1 weekDocker, K8s, CI/CD

Total: 12 weeks to production

πŸ”§ Tech Stack​

Backend: Node.js 20, Express.js, PostgreSQL 15 Integrations: GitHub API, TFC API, GCP API Infrastructure: Docker, Kubernetes, GitHub Actions Monitoring: Prometheus, Grafana, Winston

πŸ“ Code Examples​

Trigger Onboarding​

const result = await onboardingService.trigger({
tenantId: 'tenant-123',
repoUrl: 'https://github.com/acme-corp/bu-finance-infrastructure',
workspaceId: 'ws-abc123',
});
// Returns: { onboardingId, status: 'completed', entities: [...] }

Check Status​

curl https://backstage.example.com/api/onboarding/{id}

Trigger Sync​

await syncService.syncWorkspace({
workspaceId: 'ws-abc123',
tenantId: 'tenant-123',
businessUnit: 'finance',
trigger: 'manual',
});

🚨 Common Issues & Solutions​

Issue: Onboarding stuck in DISCOVERING​

Cause: TFC workspace has no successful runs Solution: Wait for first Terraform apply, or check TFC credentials

Issue: Cross-tenant violation​

Cause: Repository org doesn't match tenant mapping Solution: Verify tenant_github_mappings table has correct org

Issue: Duplicate entities​

Cause: Fingerprint collision (rare) Solution: Check onboarding_history for duplicate fingerprints

Issue: Sync not working​

Cause: Webhook not configured or invalid signature Solution: Check sync_webhooks table, regenerate webhook token

πŸ“ˆ Monitoring Checklist​

Daily:

  • Check Grafana dashboards
  • Review error logs (> 5 errors = investigate)
  • Verify sync jobs running

Weekly:

  • Review onboarding success rate (target > 95%)
  • Check drift detection results
  • Audit failed onboardings

Monthly:

  • Review tenant isolation (run cross-tenant tests)
  • Analyze performance trends
  • Update documentation

πŸŽ“ Learning Path​

Beginner (never seen the system)​

Time: 2 hours

  1. Read: README.md
  2. Read: 01-onboarding-system-overview.md
  3. Review: Architecture diagrams

Intermediate (understand basics)​

Time: 4 hours

  1. Read: 03-workflow-state-machine.md
  2. Read: 04-detection-algorithms.md
  3. Read: 05-validation-quality-gates.md
  4. Review: Code examples

Advanced (ready to implement)​

Time: 8 hours

  1. Read: 09-implementation-guide.md
  2. Read: 06-multi-client-isolation.md
  3. Read: 07-idempotency-retry.md
  4. Read: 08-synchronization-setup.md
  5. Study: All code examples

πŸ’‘ Pro Tips​

  1. Start with hybrid triggers: GitHub webhook (primary) + polling (fallback)
  2. Use explicit config: .backstage/config.yaml gives 100% detection confidence
  3. Monitor fingerprints: Duplicate fingerprints indicate configuration issues
  4. Test tenant isolation: Run cross-tenant access tests regularly
  5. Backup sync: Scheduled jobs (every 6h) catch webhook failures
  6. Quality over speed: Better to wait for 70% confidence than auto-onboard at 55%
  7. Rollback is cheap: Aggressive retry with rollback is safer than manual recovery
  8. Drift is normal: Daily drift detection prevents catalog staleness
  9. Audit everything: Audit logs are critical for debugging and compliance
  10. Monitor latency: Webhook latency > 5s indicates infrastructure issues

πŸ”— External Resources​

πŸ“ž Support​

Questions?

  1. Check this quick reference
  2. Search full documentation (use table of contents above)
  3. Review troubleshooting section in implementation guide
  4. Check audit logs for detailed operation history

Found a bug?

  1. Check known issues in implementation guide
  2. Review recent changes in audit log
  3. Check metrics for anomalies
  4. Follow debugging runbook

Version: 1.0.0 Last Updated: 2025-11-13 Maintained by: Automation Specialist Team Status: βœ… Design Complete - Ready for Implementation