Automated Business Unit Onboarding System - Complete Design
Overview
This directory contains the complete design for an automated business unit onboarding system that integrates Terraform Cloud, GitHub, GCP, and Backstage.
What Gets Automated
When Terraform creates new infrastructure:
- Detects new business unit repository and workspace
- Validates metadata and resources
- Generates Backstage catalog entities (Domain, System, Components)
- Registers entities in catalog
- Configures ongoing synchronization
Result: Zero manual intervention from Terraform apply to Backstage visibility.
Key Features
🚀 Fully Automated
- Real-time detection via webhooks (< 1 second latency)
- Automatic entity generation
- No manual catalog updates
🔒 Multi-Tenant Secure
- Complete tenant isolation (database, namespace, API)
- Zero cross-tenant data leakage
- Row-level security policies
💪 Resilient
- Idempotent operations (safe to retry)
- Automatic rollback on failure
- Exponential backoff retry
- Partial onboarding recovery
🔄 Always In Sync
- TFC webhook for state changes
- GitHub webhook for config changes
- Scheduled backup sync (every 6 hours)
- Drift detection and auto-reconciliation
📊 Observable
- Comprehensive metrics (Prometheus)
- Detailed audit logs
- Real-time alerts (Slack)
- Status dashboards
Document Structure
1. System Overview
- Architecture diagram
- Core components
- Design principles
- Success criteria
2. Trigger Mechanisms
- GitHub webhooks (real-time)
- Terraform Cloud webhooks
- Polling discovery (fallback)
- Manual trigger API
- Recommended: Hybrid approach
3. Workflow State Machine
- Complete state machine (11 states)
- State transitions and timeouts
- Retry configuration per state
- Rollback mechanisms
- Implementation code
4. Detection Algorithms
- 7 detection strategies with confidence scoring
- Composite detection (weighted scoring)
- Project detection (under business units)
- Decision thresholds (50%, 70%, 90%)
- Example detection scenarios
5. Validation & Quality Gates
- Pre-validation (fast fail)
- Deep validation (metadata, state, GCP resources)
- Quality scoring (0-100, grades A-F)
- Quality gate decision logic
- Error handling
6. Multi-Client Isolation
- Tenant identification
- Database isolation (RLS)
- Namespace isolation
- External system isolation (GitHub, TFC, GCP)
- API isolation
- Cross-tenant prevention
7. Idempotency & Retry
- Fingerprinting (idempotent keys)
- Exponential backoff
- Per-state retry configuration
- Partial recovery
- Transactional rollback
- Compensating transactions
8. Synchronization Setup
- TFC webhook configuration
- Scheduled sync jobs
- GitHub repository watchers
- Drift detection
- Sync monitoring
- Configuration management
9. Implementation Guide
- 12-week phased implementation
- Database schema
- Step-by-step instructions
- Testing strategy
- Deployment (Docker, Kubernetes)
- Troubleshooting
Quick Start
For Architects
- Start with 01-onboarding-system-overview.md
- Review architecture and components
- Read 02-trigger-mechanisms.md for event flow
- Understand 06-multi-client-isolation.md for security
For Developers
- Read 09-implementation-guide.md for step-by-step
- Review 03-workflow-state-machine.md for workflow
- Study 04-detection-algorithms.md for detection logic
- Implement following the phase-by-phase guide
For Operators
- Start with 08-synchronization-setup.md
- Understand monitoring and metrics
- Review troubleshooting section in 09-implementation-guide.md
- Set up alerts and dashboards
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ TRIGGER LAYER │
├─────────────────────────────────────────────────────────────────────┤
│ GitHub Webhook │ TFC Webhook │ Polling (5min) │ Manual API │
└────────┬───────────────┬────────────────┬──────────────────┬─────────┘
│ │ │ │
└───────────────┴────────────────┴──────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ EVENT ROUTER & ORCHESTRATOR │
│ • Deduplication (fingerprinting) │
│ • Tenant isolation validation │
│ • Rate limiting │
└────────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ ONBOARDING WORKFLOW ENGINE │
│ │
│ RECEIVED → VALIDATING → DISCOVERING → EXTRACTING → │
│ GENERATING → REGISTERING → CONFIGURING_SYNC → NOTIFYING → │
│ COMPLETED │
│ │
│ Rollback: ROLLING_BACK → FAILED │
└────────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ INTEGRATION LAYER │
├──────────────┬──────────────┬──────────────┬────────────────────────┤
│ GitHub API │ TFC API │ GCP API │ Backstage Catalog DB │
└──────────────┴──────────────┴──────────────┴────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ SYNCHRONIZATION LAYER │
│ • TFC Webhooks (real-time state changes) │
│ • Scheduled Jobs (every 6 hours) │
│ • Drift Detection (daily) │
│ • GitHub Watchers (config changes) │
└─────────────────────────────────────────────────────────────────────┘
Data Flow
Onboarding Flow
Terraform Apply (creates GCP folder + projects)
↓
GitHub repo created
↓
GitHub webhook fires (< 1s latency)
↓
Event Router (deduplicate, validate tenant)
↓
Detection (7 strategies, weighted scoring)
↓
Validation (pre-validation + deep validation)
↓
Quality Gate (score 0-100, minimum 60)
↓
Entity Generation (Domain, System, Components)
↓
Catalog Registration (transactional insert)
↓
Sync Setup (TFC webhook + scheduled jobs)
↓
Notification (Slack, email)
↓
✅ Completed (visible in Backstage)
Synchronization Flow
Terraform Apply (infrastructure change)
↓
TFC webhook fires
↓
Fetch current state
↓
Parse resources (GCP projects, folders, services)
↓
Calculate diff (added, modified, deleted entities)
↓
Apply changes (transactional update)
↓
✅ Catalog synchronized
Key Design Decisions
1. Event-Driven Architecture
Decision: Use webhooks as primary trigger mechanism Rationale: Real-time (< 1s latency), efficient (no polling), reliable (automatic retry) Fallback: Polling every 5 minutes catches missed webhooks
2. State Machine Workflow
Decision: Implement workflow as finite state machine Rationale: Clear transitions, easy debugging, resume from failure, rollback support
3. Multi-Strategy Detection
Decision: Use 7 detection strategies with weighted scoring Rationale: Robust (no single point of failure), flexible (multiple naming conventions), confident (consensus-based)
4. Quality Gates
Decision: Require minimum 60% quality score Rationale: Balance automation vs quality, allow override for edge cases
5. Tenant Isolation
Decision: Database RLS + namespace + API middleware Rationale: Defense in depth, regulatory compliance, zero trust
6. Idempotent Operations
Decision: Fingerprint-based deduplication + create-or-get pattern Rationale: Safe retry, no duplicates, distributed systems resilience
Success Criteria
Performance
- ✅ Onboarding time: < 5 minutes (target: < 2 minutes)
- ✅ Webhook latency: < 1 second
- ✅ Sync latency: < 10 seconds
- ✅ Scheduled sync: Every 6 hours
Reliability
- ✅ Success rate: > 95%
- ✅ Idempotency: 100% (safe to retry)
- ✅ Rollback: Automatic on failure
- ✅ Recovery: Resume from last successful state
Security
- ✅ Tenant isolation: 100% (zero cross-tenant leakage)
- ✅ Cross-tenant prevention: Validated at every layer
- ✅ Audit trail: All operations logged
- ✅ Secrets management: Never in state or logs
Quality
- ✅ Catalog accuracy: 100% (no stale entities)
- ✅ Drift detection: Daily reconciliation
- ✅ Entity quality: Minimum score 60/100
- ✅ Test coverage: > 90%
Implementation Timeline
| Phase | Duration | Deliverables |
|---|---|---|
| Phase 1: Foundation | Week 1-2 | Database schema, dependencies |
| Phase 2: Core Services | Week 3-4 | Detection, validation, tenant resolution |
| Phase 3: Workflow Engine | Week 5-6 | State machine, handlers, persistence |
| Phase 4: Triggers | Week 7 | GitHub/TFC webhooks, polling, manual API |
| Phase 5: Synchronization | Week 8 | Webhooks, scheduled jobs, drift detection |
| Phase 6: API & UI | Week 9 | REST API, Backstage plugin |
| Phase 7: Testing | Week 10 | Unit, integration, E2E tests |
| Phase 8: Monitoring | Week 11 | Metrics, logging, alerting |
| Phase 9: Deployment | Week 12 | Docker, Kubernetes, CI/CD |
Total: 12 weeks to production-ready system
Technology Stack
Backend
- Runtime: Node.js 20
- Framework: Express.js
- Database: PostgreSQL 15 (with RLS)
- ORM: Knex.js
- Validation: Zod
Integrations
- GitHub: @octokit/rest
- Terraform Cloud: @hashicorp/js-releases
- GCP: @google-cloud/resource-manager
- Backstage: @backstage/catalog-model
Infrastructure
- Container: Docker
- Orchestration: Kubernetes
- CI/CD: GitHub Actions
- Monitoring: Prometheus + Grafana
- Logging: Winston + Loki
Configuration
Environment Variables
DATABASE_URL=postgresql://user:password@localhost:5432/backstage
BACKSTAGE_URL=https://backstage.example.com
GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxxx
TFC_TOKEN=xxxxxxxxxxxxxxxxxxxxx
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
JWT_SECRET=your-jwt-secret
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx
Database Schema
- 15 tables
- Row-Level Security enabled
- Tenant isolation enforced
- Audit logging
- Full-text search indexes
API Endpoints
POST /api/onboarding/trigger
GET /api/onboarding/:id
POST /api/webhooks/github
POST /api/webhooks/tfc
GET /api/metrics
GET /api/health
Metrics & Monitoring
Key Metrics
onboarding_total- Counter (by tenant, status)onboarding_duration_seconds- Histogram (by tenant, state)entities_created_total- Gauge (by tenant, kind)sync_success_rate- Gauge (by workspace)sync_duration_ms- Histogramdrift_detected- Counter
Alerts
- Onboarding failure rate > 5%
- Sync failure > 3 times in 24h
- Drift detected (critical resources)
- Cross-tenant access attempt
- Rate limit exceeded
Dashboards
- Onboarding Overview (success rate, duration, volume)
- Sync Status (jobs, latency, errors)
- Tenant Metrics (per-tenant usage)
- Entity Health (catalog completeness)
Security Considerations
Threat Model
- Cross-tenant data leakage → RLS + namespace + validation
- Webhook spoofing → Signature verification
- API abuse → Rate limiting + authentication
- Secrets in state → Sanitization + detection
- Unauthorized onboarding → Tenant validation
Mitigations
- ✅ Row-Level Security (PostgreSQL)
- ✅ Namespace isolation (Backstage)
- ✅ JWT authentication
- ✅ Webhook signature verification
- ✅ Rate limiting (per tenant)
- ✅ Audit logging (all operations)
- ✅ Secrets detection (Terraform state)
Testing Strategy
Unit Tests (> 90% coverage)
- Detection algorithms
- Validation logic
- State machine transitions
- Tenant resolution
Integration Tests
- Full onboarding workflow
- Sync operations
- API endpoints
- Database operations
E2E Tests
- GitHub webhook → onboarding
- TFC webhook → sync
- Manual trigger → completion
- Drift detection → reconciliation
Load Tests
- 100 concurrent onboardings
- 1000 entities created
- Sync 50 workspaces simultaneously
Support & Maintenance
Runbooks
- Onboarding stuck troubleshooting
- Sync failure recovery
- Drift reconciliation
- Cross-tenant violation investigation
Monitoring Checklist
- Check Grafana dashboards daily
- Review error logs weekly
- Audit drift detection monthly
- Validate tenant isolation quarterly
Backup & Disaster Recovery
- Database backups (daily, retained 30 days)
- State snapshots (per onboarding)
- Audit log retention (1 year)
- Rollback procedures documented
Future Enhancements
Phase 2 (Future)
- Multi-region support
- Advanced RBAC (per-entity permissions)
- Automated entity relationship inference
- Machine learning for detection (improve accuracy)
- Cost tracking (GCP billing integration)
- Compliance reporting (SOC2, GDPR)
Nice to Have
- GraphQL API (in addition to REST)
- Real-time WebSocket updates (UI)
- Entity diff visualization
- Onboarding analytics dashboard
- Chatops integration (Slack bot)
Contributing
Code Review Checklist
- Unit tests added/updated
- Integration tests pass
- Documentation updated
- Security reviewed (no secrets, SQL injection, XSS)
- Performance tested (no N+1 queries)
- Tenant isolation validated
Development Workflow
- Create feature branch
- Implement with tests
- Run full test suite
- Submit PR with description
- Address code review feedback
- Merge to main
- Deploy to staging
- Validate in staging
- Deploy to production
- Monitor metrics
Resources
Documentation
- Backstage Catalog: https://backstage.io/docs/features/software-catalog/
- Terraform Cloud API: https://developer.hashicorp.com/terraform/cloud-docs/api-docs
- GitHub Webhooks: https://docs.github.com/en/webhooks
- GCP Resource Manager: https://cloud.google.com/resource-manager/docs
Related Projects
- Backstage Backend System: https://backstage.io/docs/backend-system/
- Terraform Provider (Backstage): https://registry.terraform.io/providers/backstage/backstage/latest/docs
Contact
For questions or support:
- Architecture: Review this documentation
- Implementation: See 09-implementation-guide.md
- Troubleshooting: Check runbooks and logs
- Security Issues: Follow responsible disclosure process
Last Updated: 2025-11-13 Version: 1.0.0 Status: Design Complete, Ready for Implementation