Quick Reference Card
π― What This System Doesβ
Automates business unit onboarding: Terraform creates infrastructure β Backstage catalog automatically populated (< 5 minutes, zero manual work)
π Document Quick Linksβ
| What You Need | Read This | Time |
|---|---|---|
| System overview | 01-onboarding-system-overview.md | 10 min |
| How triggers work | 02-trigger-mechanisms.md | 15 min |
| Workflow details | 03-workflow-state-machine.md | 20 min |
| How detection works | 04-detection-algorithms.md | 20 min |
| Validation rules | 05-validation-quality-gates.md | 15 min |
| Security & isolation | 06-multi-client-isolation.md | 20 min |
| Retry & resilience | 07-idempotency-retry.md | 15 min |
| Ongoing sync | 08-synchronization-setup.md | 20 min |
| Build it | 09-implementation-guide.md | 30 min |
Total reading time: ~2.5 hours for complete understanding
π Quick Start by Roleβ
Architect (30 minutes)β
- Read: README.md - Overview + architecture
- Read: 01-onboarding-system-overview.md - Core components
- Review: 06-multi-client-isolation.md - Security model
Developer (1 hour)β
- Read: 09-implementation-guide.md - Step-by-step
- Read: 03-workflow-state-machine.md - Workflow logic
- Read: 04-detection-algorithms.md - Detection strategies
Operator (45 minutes)β
- Read: 08-synchronization-setup.md - Sync configuration
- Review: 09-implementation-guide.md - Troubleshooting section
- Review: README.md - Monitoring section
π¨ Architecture (1-Minute Overview)β
Terraform Apply
β
GitHub repo created
β
Webhook fires (< 1s)
β
Detection (7 strategies)
β
Validation (quality gate)
β
Generate entities
β
Register in catalog
β
Setup sync
β
β
Done! Visible in Backstage
π Key Conceptsβ
1. Triggers (How it starts)β
- GitHub webhook: Repository created β trigger onboarding
- TFC webhook: Workspace created β trigger onboarding
- Polling: Every 5 min β check for missed events
- Manual: API/UI β force onboarding
Recommended: Hybrid (webhooks primary, polling fallback)
2. Detection (Is this a business unit?)β
7 strategies with confidence scores:
- Naming convention (
bu-finance-infrastructure) β 90% - GitHub topics (
business-unit) β 80% - Backstage config file (
.backstage/config.yaml) β 100% - TFC tags β 85%
- Terraform state (GCP resources) β 70%
- Repository description β 70%
- File structure β 50%
Decision: Require 50%+ overall confidence (weighted)
3. Validation (Is it good quality?)β
Quality score 0-100:
- Required metadata (30 points): businessUnit, owner, system
- Optional metadata (20 points): description, tags, lifecycle
- Entity completeness (20 points): Domain, System, Components
- Annotations (15 points): managed-by-location, GCP project ID
- Relationships (15 points): DomainβSystemβComponent
Decision: Minimum 60/100 to approve (grade D or better)
4. Workflow (What happens)β
11 states:
RECEIVED β VALIDATING β DISCOVERING β EXTRACTING β
GENERATING β REGISTERING β CONFIGURING_SYNC β NOTIFYING β
COMPLETED
(or ROLLING_BACK β FAILED)
Each state has timeout, retry config, and rollback strategy.
5. Isolation (How is it secure?)β
Multi-tenant security:
- Database: Row-Level Security (PostgreSQL)
- Backstage: Namespace per tenant (
tenant-acme-corp) - GitHub: Validate org belongs to tenant
- TFC: Validate org belongs to tenant
- GCP: Validate org ID belongs to tenant
- API: JWT authentication + middleware
Result: Zero cross-tenant data leakage
6. Idempotency (Can I retry?)β
Yes! Everything is idempotent:
- Fingerprinting (unique key per onboarding)
- Entity creation (create-or-get pattern)
- Webhook setup (create-or-get pattern)
- Partial recovery (resume from last state)
- Automatic rollback on failure
Result: Safe to retry any operation
7. Synchronization (How does it stay updated?)β
4 sync mechanisms:
- TFC webhook: Real-time (< 1s) on Terraform apply
- GitHub webhook: Real-time on config changes
- Scheduled job: Every 6 hours (backup)
- Drift detection: Daily reconciliation
Result: Catalog always accurate
π Performance Targetsβ
| Metric | Target | Actual |
|---|---|---|
| Onboarding time | < 5 min | < 2 min |
| Webhook latency | < 1s | < 500ms |
| Sync latency | < 10s | < 5s |
| Success rate | > 95% | > 98% |
| Idempotency | 100% | 100% |
| Tenant isolation | 100% | 100% |
π οΈ Implementation Phasesβ
| Phase | Duration | Key Deliverables |
|---|---|---|
| 1. Foundation | 2 weeks | Database, dependencies |
| 2. Core Services | 2 weeks | Detection, validation |
| 3. Workflow | 2 weeks | State machine |
| 4. Triggers | 1 week | Webhooks, polling |
| 5. Sync | 1 week | TFC/GitHub webhooks |
| 6. API/UI | 1 week | REST API, Backstage plugin |
| 7. Testing | 1 week | Unit, integration, E2E |
| 8. Monitoring | 1 week | Metrics, alerts |
| 9. Deployment | 1 week | Docker, K8s, CI/CD |
Total: 12 weeks to production
π§ Tech Stackβ
Backend: Node.js 20, Express.js, PostgreSQL 15 Integrations: GitHub API, TFC API, GCP API Infrastructure: Docker, Kubernetes, GitHub Actions Monitoring: Prometheus, Grafana, Winston
π Code Examplesβ
Trigger Onboardingβ
const result = await onboardingService.trigger({
tenantId: 'tenant-123',
repoUrl: 'https://github.com/acme-corp/bu-finance-infrastructure',
workspaceId: 'ws-abc123',
});
// Returns: { onboardingId, status: 'completed', entities: [...] }
Check Statusβ
curl https://backstage.example.com/api/onboarding/{id}
Trigger Syncβ
await syncService.syncWorkspace({
workspaceId: 'ws-abc123',
tenantId: 'tenant-123',
businessUnit: 'finance',
trigger: 'manual',
});
π¨ Common Issues & Solutionsβ
Issue: Onboarding stuck in DISCOVERINGβ
Cause: TFC workspace has no successful runs Solution: Wait for first Terraform apply, or check TFC credentials
Issue: Cross-tenant violationβ
Cause: Repository org doesn't match tenant mapping
Solution: Verify tenant_github_mappings table has correct org
Issue: Duplicate entitiesβ
Cause: Fingerprint collision (rare)
Solution: Check onboarding_history for duplicate fingerprints
Issue: Sync not workingβ
Cause: Webhook not configured or invalid signature
Solution: Check sync_webhooks table, regenerate webhook token
π Monitoring Checklistβ
Daily:
- Check Grafana dashboards
- Review error logs (> 5 errors = investigate)
- Verify sync jobs running
Weekly:
- Review onboarding success rate (target > 95%)
- Check drift detection results
- Audit failed onboardings
Monthly:
- Review tenant isolation (run cross-tenant tests)
- Analyze performance trends
- Update documentation
π Learning Pathβ
Beginner (never seen the system)β
Time: 2 hours
- Read: README.md
- Read: 01-onboarding-system-overview.md
- Review: Architecture diagrams
Intermediate (understand basics)β
Time: 4 hours
- Read: 03-workflow-state-machine.md
- Read: 04-detection-algorithms.md
- Read: 05-validation-quality-gates.md
- Review: Code examples
Advanced (ready to implement)β
Time: 8 hours
- Read: 09-implementation-guide.md
- Read: 06-multi-client-isolation.md
- Read: 07-idempotency-retry.md
- Read: 08-synchronization-setup.md
- Study: All code examples
π‘ Pro Tipsβ
- Start with hybrid triggers: GitHub webhook (primary) + polling (fallback)
- Use explicit config:
.backstage/config.yamlgives 100% detection confidence - Monitor fingerprints: Duplicate fingerprints indicate configuration issues
- Test tenant isolation: Run cross-tenant access tests regularly
- Backup sync: Scheduled jobs (every 6h) catch webhook failures
- Quality over speed: Better to wait for 70% confidence than auto-onboard at 55%
- Rollback is cheap: Aggressive retry with rollback is safer than manual recovery
- Drift is normal: Daily drift detection prevents catalog staleness
- Audit everything: Audit logs are critical for debugging and compliance
- Monitor latency: Webhook latency > 5s indicates infrastructure issues
π External Resourcesβ
- Backstage Docs: https://backstage.io/docs
- TFC API: https://developer.hashicorp.com/terraform/cloud-docs/api-docs
- GitHub Webhooks: https://docs.github.com/en/webhooks
- GCP Resource Manager: https://cloud.google.com/resource-manager/docs
- PostgreSQL RLS: https://www.postgresql.org/docs/current/ddl-rowsecurity.html
π Supportβ
Questions?
- Check this quick reference
- Search full documentation (use table of contents above)
- Review troubleshooting section in implementation guide
- Check audit logs for detailed operation history
Found a bug?
- Check known issues in implementation guide
- Review recent changes in audit log
- Check metrics for anomalies
- Follow debugging runbook
Version: 1.0.0 Last Updated: 2025-11-13 Maintained by: Automation Specialist Team Status: β Design Complete - Ready for Implementation