Architecture Decision Records (ADR) Summary
Multi-Tenant Backstage Terraform Cloud Plugin
Document Version: 1.0 Last Updated: 2024-11-13
Overview
This document summarizes all architectural decisions made for the multi-tenant Backstage plugin integrating with Terraform Cloud. Each ADR follows the format:
- Status: Accepted | Proposed | Deprecated | Superseded
- Context: Problem statement and constraints
- Decision: What was decided
- Rationale: Why this decision was made
- Consequences: Trade-offs and implications
ADR Index
| ID | Title | Status | Date | Impact |
|---|---|---|---|---|
| ADR-001 | Row-Level Security for Tenant Isolation | Accepted | 2024-11-13 | High |
| ADR-002 | Pub/Sub for Asynchronous Processing | Accepted | 2024-11-13 | High |
| ADR-003 | Real-Time Sync via Terraform Cloud Webhooks | Accepted | 2024-11-13 | Medium |
| ADR-004 | In-Memory State Sanitization | Accepted | 2024-11-13 | High |
| ADR-005 | PostgreSQL Partitioning by Tenant | Accepted | 2024-11-13 | Medium |
| ADR-006 | Redis for Catalog Caching | Accepted | 2024-11-13 | Medium |
| ADR-007 | Google Cloud Platform as Primary Provider | Accepted | 2024-11-13 | High |
| ADR-008 | Node.js 20 LTS for Backend Runtime | Accepted | 2024-11-13 | Low |
| ADR-009 | Kubernetes (GKE) for Container Orchestration | Accepted | 2024-11-13 | High |
| ADR-010 | Managed Prometheus for Observability | Accepted | 2024-11-13 | Low |
ADR-001: Row-Level Security for Tenant Isolation
Status: ✅ Accepted Date: 2024-11-13 Impact: High (Security-critical)
Context
Multi-tenant SaaS application requires strict data isolation between enterprise clients. Each tenant must only access their own data, with zero risk of cross-tenant data leaks. Regulatory compliance (SOC 2, GDPR) requires database-enforced isolation, not just application-layer checks.
Options Considered:
- Separate Databases per Tenant: Maximum isolation, but high operational cost (100+ databases)
- Separate Schemas per Tenant: Good isolation, but complex migrations
- Row-Level Security (RLS): Database-enforced isolation in single database
- Discriminator Column Only: Application-layer filtering (insufficient security)
Decision
Implement PostgreSQL Row-Level Security (RLS) with tenant_id discriminator column.
Rationale
Advantages:
- Security: Database enforces isolation via SQL policies (not application code)
- Cost: Single database instance reduces operational overhead by 95%
- Performance: Indexed
tenant_idcolumn + partitioning maintains query speed - Compliance: Meets SOC 2 Type II data isolation requirements
- Simplicity: Single backup/restore process, no complex multi-DB orchestration
Implementation:
-- Enable RLS on catalog_entities table
ALTER TABLE catalog_entities ENABLE ROW LEVEL SECURITY;
-- Policy: Users can only see their tenant's data
CREATE POLICY tenant_isolation_policy ON catalog_entities
USING (tenant_id = current_setting('app.current_tenant')::UUID);
Tenant Context Injection:
// Middleware sets PostgreSQL session variable
await db.query('SET LOCAL app.current_tenant = $1', [tenantId]);
Consequences
Positive:
- Single database simplifies backups, migrations, and disaster recovery
- Lower infrastructure cost: $800/month vs. $80,000/month (100 databases)
- Automatic enforcement (no risk of application bugs bypassing checks)
Negative:
- PostgreSQL version dependency (requires 9.5+, already met)
- RLS policy complexity increases with authorization logic
- Slight performance overhead (~5-10ms per query)
Mitigation:
- Comprehensive RLS policy testing in CI/CD
- Automated leak detection (hourly job checks for cross-tenant entities)
- Regular security audits of RLS policies
Monitoring:
-- Detect cross-tenant leaks (alert if any results)
SELECT entity_ref, COUNT(DISTINCT tenant_id) as tenant_count
FROM catalog_entities
GROUP BY entity_ref
HAVING COUNT(DISTINCT tenant_id) > 1;
ADR-002: Pub/Sub for Asynchronous Processing
Status: ✅ Accepted Date: 2024-11-13 Impact: High (Scalability-critical)
Context
System must handle burst traffic from Terraform Cloud webhooks (1000+ workspace updates within minutes). Synchronous processing would block API requests and cause timeouts. Need decoupling between webhook ingestion and state processing.
Options Considered:
- Google Cloud Pub/Sub: Managed, serverless, auto-scaling
- RabbitMQ (self-hosted): Full control, complex operations
- Apache Kafka (self-hosted): High throughput, heavy operational overhead
- Database as Queue (SKIP LOCKED): Simple, but doesn't scale
Decision
Use Google Cloud Pub/Sub for all asynchronous task distribution.
Rationale
Advantages:
- Scalability: Managed service scales to millions of messages/sec automatically
- Reliability: At-least-once delivery guarantee with dead letter queues
- Integration: Native GCP integration (IAM auth, Cloud Monitoring)
- Cost: Pay-per-use ($40/million messages vs. $500/month for RabbitMQ VMs)
- Operations: Zero maintenance (no cluster management, no broker failures)
Message Flow:
Topics:
terraform-workspace-discovered- New workspace found via GitHub scanterraform-state-updated- State version changed (webhook trigger)sanitization-failed- Dead letter queue for manual review
Consequences
Positive:
- Decoupled architecture: Webhook handlers respond in < 100ms
- Elastic capacity: Automatically handles burst traffic (1000 msg/sec)
- Retry logic: Exponential backoff (10s → 600s) for transient failures
Negative:
- At-least-once delivery: Workers must be idempotent (duplicate messages possible)
- GCP vendor lock-in: Migrating to Kafka/RabbitMQ requires code changes
- Cost at extreme scale: $40/million messages (10M messages = $400/month)
Mitigation:
- Idempotent message handlers (check entity
updated_attimestamp) - Multi-cloud abstraction layer (future: support AWS SQS, Azure Service Bus)
- Cost monitoring: Alert if message volume exceeds 5M/day
Idempotent Handler Example:
async function handleStateUpdate(message: StateUpdateMessage) {
const { workspace_id, state_version } = message;
// Check if already processed
const existing = await db.query(
'SELECT state_version FROM catalog_entities WHERE workspace_id = $1',
[workspace_id]
);
if (existing.state_version === state_version) {
console.log('Duplicate message, skipping');
return; // Already processed
}
// Process state update...
}
ADR-003: Real-Time Sync via Terraform Cloud Webhooks
Status: ✅ Accepted Date: 2024-11-13 Impact: Medium (User Experience)
Context
Users expect infrastructure changes to appear in Backstage catalog within 5 minutes of terraform apply. Polling Terraform Cloud every 5 minutes would require 20,000 API calls/hour (20,000 workspaces × 12 polls/hour), hitting rate limits.
Options Considered:
- Webhooks (Event-Driven): Near-instant updates, low API usage
- Polling (Every 5 minutes): Simple, but high API cost and latency
- Hybrid (Webhooks + Hourly Poll): Best of both worlds
Decision
Use Terraform Cloud webhooks as primary sync mechanism, with hourly polling as fallback.
Rationale
Advantages:
- Latency: Webhook delivers event in < 30 seconds (vs. 2.5 minutes average for polling)
- Efficiency: Event-driven reduces API calls by 95% (20K → 1K calls/hour)
- Cost: Lower Terraform Cloud API usage (rate limit budget preserved)
- User Experience: "Live" catalog updates instead of stale data
Webhook Configuration:
# Per-tenant Terraform Cloud webhook
webhooks:
- name: "backstage-catalog-sync"
url: "https://plugin.backstage.example.com/api/webhooks/terraform-cloud"
token: "<hmac-secret>"
events:
- "run:completed"
- "run:errored"
- "workspace:created"
- "workspace:deleted"
Security:
- HMAC SHA-256 signature verification
- Timestamp validation (reject if > 5 minutes old)
- IP allowlist (Terraform Cloud egress IPs)
Consequences
Positive:
- Near-real-time catalog updates (< 5 minutes → < 30 seconds)
- 95% reduction in Terraform Cloud API calls
- Better user experience (perceived "live" updates)
Negative:
- Webhook reliability dependency: If Terraform Cloud webhook delivery fails, catalog stale
- HMAC verification overhead: Every webhook request requires signature check
- Replay attack surface: Must track nonces to prevent replay
Mitigation:
- Fallback polling: Hourly full sync job catches missed webhooks
- Webhook retry logic: Terraform Cloud retries up to 3 times with exponential backoff
- Monitoring: Alert if no webhooks received for > 2 hours
Fallback Polling Job:
// Run every hour as safety net
setInterval(async () => {
console.log('Starting hourly full sync (fallback)...');
for (const tenant of tenants) {
const workspaces = await terraformCloud.listWorkspaces(tenant.org);
for (const workspace of workspaces) {
const lastSyncTime = await getLastSyncTime(workspace.id);
const hoursSinceSync = (Date.now() - lastSyncTime) / (1000 * 60 * 60);
if (hoursSinceSync > 1) {
await enqueueStateUpdate(workspace.id); // Missed webhook
}
}
}
}, 3600000); // 1 hour
ADR-004: In-Memory State Sanitization
Status: ✅ Accepted Date: 2024-11-13 Impact: High (Security-critical)
Context
Terraform state files contain sensitive data:
- API keys, passwords, database credentials
- Service account keys (GCP, AWS)
- Private IP addresses and internal hostnames
- Personally Identifiable Information (PII)
This data must never be persisted to disk or reach the Backstage catalog. Regulatory compliance (GDPR, CCPA, PCI-DSS) requires demonstrable sanitization.
Options Considered:
- In-Memory Sanitization: Process state in RAM, never write unredacted to disk
- Temporary File Sanitization: Write to disk, sanitize, then delete
- Database-Side Sanitization: Store raw state, sanitize on read (❌ unacceptable)
Decision
Sanitize Terraform state in-memory only, never persisting unredacted state to disk or database.
Rationale
Advantages:
- Security: No plaintext secrets on disk (reduces attack surface)
- Compliance: GDPR/CCPA "right to be forgotten" - no persistent PII
- Performance: In-memory processing 3-5x faster than disk I/O
- Auditability: All redactions logged before persistence
Sanitization Process:
// Worker receives state JSON from Terraform Cloud
const stateJson = await fetch(stateDownloadUrl).then(r => r.json());
// Load tenant-specific sanitization rules
const sanitizer = new StateSanitizer(tenantId);
// Sanitize in-memory (never write unredacted state to disk)
const { sanitized, violations } = await sanitizer.sanitize(stateJson);
// Log all redactions for audit
await auditLog.record({
workspace_id: workspace.id,
violations: violations, // [{ path: "resource.password", rule: "credential" }]
pii_count: violations.filter(v => v.category === 'pii').length,
credential_count: violations.filter(v => v.category === 'credential').length
});
// Persist ONLY sanitized state
await db.query('INSERT INTO catalog_entities ...', [sanitized]);
Sanitization Rules:
const defaultRules: SanitizationRule[] = [
{
id: 'strip-passwords',
pattern: /(password|passwd|pwd)\s*[:=]\s*["']?([^"'\s]+)/i,
action: 'redact',
category: 'credential'
},
{
id: 'strip-api-keys',
pattern: /[a-zA-Z0-9]{32,}/, // High-entropy strings
min_entropy: 4.5,
action: 'redact',
category: 'credential'
},
{
id: 'strip-emails',
pattern: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/,
action: 'redact',
category: 'pii'
}
];
Consequences
Positive:
- Stronger security posture (no unredacted state persisted)
- Compliance-ready (GDPR Article 32: "pseudonymisation and encryption")
- Faster processing (no disk I/O for temporary files)
Negative:
- Higher memory requirements: Each worker needs 2-4GB RAM (state files 100-500MB)
- Complex sanitization logic: Regex rules require tuning to avoid false positives
- Loss of raw state: Cannot "undo" over-aggressive redactions
Mitigation:
- Worker pod memory limits: 4GB per pod
- Configurable allowlists: Tenants can mark specific patterns as "safe"
- Audit trail: Log all redactions for manual review
- Dead letter queue: States with excessive violations (> 100) go to manual review
Memory Management:
# Kubernetes resource limits
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
ADR-005: PostgreSQL Partitioning by Tenant
Status: ✅ Accepted Date: 2024-11-13 Impact: Medium (Performance Optimization)
Context
At 100 clients with 1000 entities each (100,000 total entities), full table scans become slow. Query latency exceeds 500ms for SELECT * FROM catalog_entities WHERE tenant_id = ? without partitioning.
Target Performance:
- < 100ms p95 query latency at 100K entities
- < 200ms p95 query latency at 1M entities
Decision
Implement hash partitioning by tenant_id with 10 partitions.
Rationale
Advantages:
- Query Performance: Partition pruning reduces scanned rows by 90%
- Index Efficiency: Smaller indexes per partition (10x faster lookups)
- Maintenance: Partition-level VACUUM and ANALYZE
- Scalability: Add partitions as tenant count grows
Partitioning Strategy:
-- Create partitioned table
CREATE TABLE catalog_entities (
entity_id UUID DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
entity_ref VARCHAR(500) NOT NULL,
metadata JSONB NOT NULL,
...
) PARTITION BY HASH (tenant_id);
-- Create 10 partitions (adjust based on tenant count)
CREATE TABLE catalog_entities_p0 PARTITION OF catalog_entities
FOR VALUES WITH (MODULUS 10, REMAINDER 0);
CREATE TABLE catalog_entities_p1 PARTITION OF catalog_entities
FOR VALUES WITH (MODULUS 10, REMAINDER 1);
-- ... create p2 through p9
-- Index on each partition
CREATE INDEX idx_entities_p0_ref ON catalog_entities_p0(entity_ref);
-- ... create indexes on p1 through p9
Query Plan (with partitioning):
EXPLAIN SELECT * FROM catalog_entities WHERE tenant_id = '123e4567-e89b-12d3-a456-426614174000';
-- Result: Partition pruning activated, only 1 partition scanned
Seq Scan on catalog_entities_p3 (rows=1000, actual=85ms)
Filter: (tenant_id = '123e4567-e89b-12d3-a456-426614174000')
Consequences
Positive:
- 10x query performance improvement (500ms → 50ms)
- Better cache utilization (smaller working set)
- Isolation of "hot" tenants (large tenants in separate partitions)
Negative:
- Schema complexity (10 partition tables instead of 1)
- Migration overhead (partitioning existing data requires downtime)
- Cross-partition queries slower (rare in multi-tenant app)
Mitigation:
- Partition count configurable (start with 10, increase to 50 at 500+ tenants)
- Online partitioning (PostgreSQL 11+): Add partitions without downtime
- Automated partition management: Add new partitions when tenant count grows
ADR-006: Redis for Catalog Caching
Status: ✅ Accepted Date: 2024-11-13 Impact: Medium (Performance Optimization)
Context
Backstage UI makes frequent catalog queries (10-50 requests/page load). Database queries for the same entity (e.g., "list all components") are repetitive and slow (100-200ms). Target: < 50ms p95 API latency.
Decision
Implement Redis caching with cache-aside pattern and 5-minute TTL.
Rationale
Advantages:
- Latency: Cache hits respond in 5-10ms (vs. 100-200ms database query)
- Database Load: Offload 70-80% of reads to cache
- Cost: Redis cheaper than additional database read replicas
Cache Strategy:
// Cache-aside pattern
async function getCatalogEntity(tenantId: string, entityRef: string): Promise<Entity> {
const cacheKey = `entity:${tenantId}:${entityRef}`;
// Try cache first
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached); // Cache hit (5ms)
}
// Cache miss: query database
const entity = await db.query(
'SELECT * FROM catalog_entities WHERE tenant_id = $1 AND entity_ref = $2',
[tenantId, entityRef]
);
// Store in cache with TTL
await redis.setex(cacheKey, 300, JSON.stringify(entity)); // 5 minutes
return entity;
}
Cache Invalidation:
- Write-through: Update cache on every database write
- TTL-based: Expire after 5 minutes (eventual consistency acceptable)
- Manual invalidation: Webhook events trigger cache delete
Consequences
Positive:
- 70% cache hit rate achieves 50ms p95 API latency
- Database CPU usage reduced by 60%
- Better user experience (faster page loads)
Negative:
- Stale data: 5-minute TTL means catalog may be outdated
- Memory cost: Redis cluster ($200/month for 10GB)
- Complexity: Cache invalidation logic required
Mitigation:
- Webhook-driven invalidation for critical entities
- Monitoring: Alert if cache hit rate < 60%
- Redis Cluster for high availability
ADR-007: Google Cloud Platform as Primary Provider
Status: ✅ Accepted Date: 2024-11-13 Impact: High (Infrastructure)
Context
Need to select cloud provider for hosting multi-tenant Backstage plugin. Requirements:
- Managed Kubernetes (GKE, EKS, AKS)
- Managed PostgreSQL with high availability
- Managed message queue (Pub/Sub, SQS, Service Bus)
- Secret management with automatic rotation
Decision
Deploy on Google Cloud Platform (GCP) with multi-cloud abstraction layer.
Rationale
Advantages:
- GKE: Best-in-class Kubernetes (autopilot mode, auto-scaling)
- Cloud SQL: Managed PostgreSQL with RLS support, CMEK encryption
- Pub/Sub: Serverless message queue (vs. self-hosted RabbitMQ)
- Integration: Native IAM integration (no service account keys)
- Cost: 20% cheaper than AWS for equivalent workload
Cost Comparison (100 clients):
| Service | GCP | AWS | Difference |
|---|---|---|---|
| Kubernetes | $2,500 | $3,200 | -$700 |
| PostgreSQL HA | $800 | $1,100 | -$300 |
| Message Queue | $400 | $500 | -$100 |
| Total | $4,000 | $5,100 | -$1,100 |
Consequences
Positive:
- Lower operational cost (20% savings)
- Better Kubernetes experience (GKE autopilot)
- Managed services reduce maintenance burden
Negative:
- GCP vendor lock-in: Migrating to AWS requires re-architecture
- Learning curve: Team must learn GCP-specific services
Mitigation:
- Multi-cloud abstraction layer (support AWS in future)
- Infrastructure-as-code (Terraform) for portability
- Avoid GCP-specific features (use standard Kubernetes APIs)
ADR-008: Node.js 20 LTS for Backend Runtime
Status: ✅ Accepted Date: 2024-11-13 Impact: Low (Technology Choice)
Context
Backstage is built on Node.js/TypeScript. Need to select runtime version for backend plugin.
Decision
Use Node.js 20 LTS (Long-Term Support).
Rationale
Advantages:
- Compatibility: Backstage officially supports Node.js 18-20
- Performance: 10-20% faster than Node.js 16
- Security: LTS receives security patches until April 2026
- Modern Features: Native fetch API, test runner
Consequences
Positive:
- Faster API response times (native performance improvements)
- Better developer experience (modern JavaScript features)
- Long-term support (no forced upgrade for 2+ years)
Negative:
- Must upgrade from Node.js 16 (some dependencies may break)
Mitigation:
- Test all dependencies with Node.js 20 before deployment
- Automated CI/CD tests on Node.js 20
ADR-009: Kubernetes (GKE) for Container Orchestration
Status: ✅ Accepted Date: 2024-11-13 Impact: High (Infrastructure)
Context
Need container orchestration for scaling backend pods (3-50 replicas), worker pods (5-100 replicas), and webhook handlers (2-20 replicas).
Decision
Deploy on Google Kubernetes Engine (GKE) with Autopilot mode.
Rationale
Advantages:
- Auto-Scaling: Horizontal Pod Autoscaler (HPA) + Cluster Autoscaler
- High Availability: Multi-zone pod distribution
- Managed: Google manages control plane, node upgrades
- Integration: Native GCP service integration (Cloud SQL, Pub/Sub)
Consequences
Positive:
- Zero-downtime deployments (rolling updates)
- Elastic capacity (auto-scale from 3 to 50 pods)
- Simplified operations (no manual node management)
Negative:
- Kubernetes complexity (steep learning curve)
- Higher cost than VMs (but offset by auto-scaling savings)
Mitigation:
- Use Backstage Kubernetes plugin for visibility
- Infrastructure-as-code (Terraform) for reproducibility
ADR-010: Managed Prometheus for Observability
Status: ✅ Accepted Date: 2024-11-13 Impact: Low (Observability)
Context
Need metrics collection for monitoring API latency, error rates, queue depth, cache hit rates.
Decision
Use Google Cloud Managed Service for Prometheus.
Rationale
Advantages:
- Managed: No self-hosted Prometheus cluster
- Scalability: Handles millions of metrics
- Integration: Native GKE integration (auto-discovery)
Consequences
Positive:
- Zero operational overhead
- High availability (99.9% SLA)
Negative:
- GCP vendor lock-in (vs. self-hosted Prometheus)
Mitigation:
- Use standard Prometheus metrics format (portable to other backends)
Decision Matrix
| Decision | Security Impact | Cost Impact | Complexity | Reversibility |
|---|---|---|---|---|
| ADR-001: RLS | ✅ High | ✅ Low | Medium | Hard |
| ADR-002: Pub/Sub | Medium | ✅ Low | Low | Medium |
| ADR-003: Webhooks | Medium | ✅ Low | Medium | Easy |
| ADR-004: In-Memory Sanitization | ✅ High | Medium | High | Hard |
| ADR-005: Partitioning | Low | ✅ Low | Medium | Hard |
| ADR-006: Redis Cache | Low | Medium | Low | Easy |
| ADR-007: GCP | Low | ✅ High | Medium | Hard |
| ADR-008: Node.js 20 | Low | ✅ Low | Low | Easy |
| ADR-009: GKE | Low | Medium | High | Hard |
| ADR-010: Managed Prometheus | Low | ✅ Low | Low | Easy |
Document Maintenance:
- Review ADRs quarterly
- Update status when decisions superseded
- Add new ADRs for significant architectural changes
Related Documents:
- Enterprise SaaS Plugin Architecture
- Component Diagrams
- Security Architecture (Pending)
Last Updated: 2024-11-13 Next Review: 2025-02-13