Skip to main content

Architecture Decision Records (ADR) Summary

Multi-Tenant Backstage Terraform Cloud Plugin

Document Version: 1.0 Last Updated: 2024-11-13


Overview

This document summarizes all architectural decisions made for the multi-tenant Backstage plugin integrating with Terraform Cloud. Each ADR follows the format:

  1. Status: Accepted | Proposed | Deprecated | Superseded
  2. Context: Problem statement and constraints
  3. Decision: What was decided
  4. Rationale: Why this decision was made
  5. Consequences: Trade-offs and implications

ADR Index

IDTitleStatusDateImpact
ADR-001Row-Level Security for Tenant IsolationAccepted2024-11-13High
ADR-002Pub/Sub for Asynchronous ProcessingAccepted2024-11-13High
ADR-003Real-Time Sync via Terraform Cloud WebhooksAccepted2024-11-13Medium
ADR-004In-Memory State SanitizationAccepted2024-11-13High
ADR-005PostgreSQL Partitioning by TenantAccepted2024-11-13Medium
ADR-006Redis for Catalog CachingAccepted2024-11-13Medium
ADR-007Google Cloud Platform as Primary ProviderAccepted2024-11-13High
ADR-008Node.js 20 LTS for Backend RuntimeAccepted2024-11-13Low
ADR-009Kubernetes (GKE) for Container OrchestrationAccepted2024-11-13High
ADR-010Managed Prometheus for ObservabilityAccepted2024-11-13Low

ADR-001: Row-Level Security for Tenant Isolation

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Security-critical)

Context

Multi-tenant SaaS application requires strict data isolation between enterprise clients. Each tenant must only access their own data, with zero risk of cross-tenant data leaks. Regulatory compliance (SOC 2, GDPR) requires database-enforced isolation, not just application-layer checks.

Options Considered:

  1. Separate Databases per Tenant: Maximum isolation, but high operational cost (100+ databases)
  2. Separate Schemas per Tenant: Good isolation, but complex migrations
  3. Row-Level Security (RLS): Database-enforced isolation in single database
  4. Discriminator Column Only: Application-layer filtering (insufficient security)

Decision

Implement PostgreSQL Row-Level Security (RLS) with tenant_id discriminator column.

Rationale

Advantages:

  • Security: Database enforces isolation via SQL policies (not application code)
  • Cost: Single database instance reduces operational overhead by 95%
  • Performance: Indexed tenant_id column + partitioning maintains query speed
  • Compliance: Meets SOC 2 Type II data isolation requirements
  • Simplicity: Single backup/restore process, no complex multi-DB orchestration

Implementation:

-- Enable RLS on catalog_entities table
ALTER TABLE catalog_entities ENABLE ROW LEVEL SECURITY;

-- Policy: Users can only see their tenant's data
CREATE POLICY tenant_isolation_policy ON catalog_entities
USING (tenant_id = current_setting('app.current_tenant')::UUID);

Tenant Context Injection:

// Middleware sets PostgreSQL session variable
await db.query('SET LOCAL app.current_tenant = $1', [tenantId]);

Consequences

Positive:

  • Single database simplifies backups, migrations, and disaster recovery
  • Lower infrastructure cost: $800/month vs. $80,000/month (100 databases)
  • Automatic enforcement (no risk of application bugs bypassing checks)

Negative:

  • PostgreSQL version dependency (requires 9.5+, already met)
  • RLS policy complexity increases with authorization logic
  • Slight performance overhead (~5-10ms per query)

Mitigation:

  • Comprehensive RLS policy testing in CI/CD
  • Automated leak detection (hourly job checks for cross-tenant entities)
  • Regular security audits of RLS policies

Monitoring:

-- Detect cross-tenant leaks (alert if any results)
SELECT entity_ref, COUNT(DISTINCT tenant_id) as tenant_count
FROM catalog_entities
GROUP BY entity_ref
HAVING COUNT(DISTINCT tenant_id) > 1;

ADR-002: Pub/Sub for Asynchronous Processing

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Scalability-critical)

Context

System must handle burst traffic from Terraform Cloud webhooks (1000+ workspace updates within minutes). Synchronous processing would block API requests and cause timeouts. Need decoupling between webhook ingestion and state processing.

Options Considered:

  1. Google Cloud Pub/Sub: Managed, serverless, auto-scaling
  2. RabbitMQ (self-hosted): Full control, complex operations
  3. Apache Kafka (self-hosted): High throughput, heavy operational overhead
  4. Database as Queue (SKIP LOCKED): Simple, but doesn't scale

Decision

Use Google Cloud Pub/Sub for all asynchronous task distribution.

Rationale

Advantages:

  • Scalability: Managed service scales to millions of messages/sec automatically
  • Reliability: At-least-once delivery guarantee with dead letter queues
  • Integration: Native GCP integration (IAM auth, Cloud Monitoring)
  • Cost: Pay-per-use ($40/million messages vs. $500/month for RabbitMQ VMs)
  • Operations: Zero maintenance (no cluster management, no broker failures)

Message Flow:

Topics:

  • terraform-workspace-discovered - New workspace found via GitHub scan
  • terraform-state-updated - State version changed (webhook trigger)
  • sanitization-failed - Dead letter queue for manual review

Consequences

Positive:

  • Decoupled architecture: Webhook handlers respond in < 100ms
  • Elastic capacity: Automatically handles burst traffic (1000 msg/sec)
  • Retry logic: Exponential backoff (10s → 600s) for transient failures

Negative:

  • At-least-once delivery: Workers must be idempotent (duplicate messages possible)
  • GCP vendor lock-in: Migrating to Kafka/RabbitMQ requires code changes
  • Cost at extreme scale: $40/million messages (10M messages = $400/month)

Mitigation:

  • Idempotent message handlers (check entity updated_at timestamp)
  • Multi-cloud abstraction layer (future: support AWS SQS, Azure Service Bus)
  • Cost monitoring: Alert if message volume exceeds 5M/day

Idempotent Handler Example:

async function handleStateUpdate(message: StateUpdateMessage) {
const { workspace_id, state_version } = message;

// Check if already processed
const existing = await db.query(
'SELECT state_version FROM catalog_entities WHERE workspace_id = $1',
[workspace_id]
);

if (existing.state_version === state_version) {
console.log('Duplicate message, skipping');
return; // Already processed
}

// Process state update...
}

ADR-003: Real-Time Sync via Terraform Cloud Webhooks

Status: ✅ Accepted Date: 2024-11-13 Impact: Medium (User Experience)

Context

Users expect infrastructure changes to appear in Backstage catalog within 5 minutes of terraform apply. Polling Terraform Cloud every 5 minutes would require 20,000 API calls/hour (20,000 workspaces × 12 polls/hour), hitting rate limits.

Options Considered:

  1. Webhooks (Event-Driven): Near-instant updates, low API usage
  2. Polling (Every 5 minutes): Simple, but high API cost and latency
  3. Hybrid (Webhooks + Hourly Poll): Best of both worlds

Decision

Use Terraform Cloud webhooks as primary sync mechanism, with hourly polling as fallback.

Rationale

Advantages:

  • Latency: Webhook delivers event in < 30 seconds (vs. 2.5 minutes average for polling)
  • Efficiency: Event-driven reduces API calls by 95% (20K → 1K calls/hour)
  • Cost: Lower Terraform Cloud API usage (rate limit budget preserved)
  • User Experience: "Live" catalog updates instead of stale data

Webhook Configuration:

# Per-tenant Terraform Cloud webhook
webhooks:
- name: "backstage-catalog-sync"
url: "https://plugin.backstage.example.com/api/webhooks/terraform-cloud"
token: "<hmac-secret>"
events:
- "run:completed"
- "run:errored"
- "workspace:created"
- "workspace:deleted"

Security:

  • HMAC SHA-256 signature verification
  • Timestamp validation (reject if > 5 minutes old)
  • IP allowlist (Terraform Cloud egress IPs)

Consequences

Positive:

  • Near-real-time catalog updates (< 5 minutes → < 30 seconds)
  • 95% reduction in Terraform Cloud API calls
  • Better user experience (perceived "live" updates)

Negative:

  • Webhook reliability dependency: If Terraform Cloud webhook delivery fails, catalog stale
  • HMAC verification overhead: Every webhook request requires signature check
  • Replay attack surface: Must track nonces to prevent replay

Mitigation:

  • Fallback polling: Hourly full sync job catches missed webhooks
  • Webhook retry logic: Terraform Cloud retries up to 3 times with exponential backoff
  • Monitoring: Alert if no webhooks received for > 2 hours

Fallback Polling Job:

// Run every hour as safety net
setInterval(async () => {
console.log('Starting hourly full sync (fallback)...');

for (const tenant of tenants) {
const workspaces = await terraformCloud.listWorkspaces(tenant.org);

for (const workspace of workspaces) {
const lastSyncTime = await getLastSyncTime(workspace.id);
const hoursSinceSync = (Date.now() - lastSyncTime) / (1000 * 60 * 60);

if (hoursSinceSync > 1) {
await enqueueStateUpdate(workspace.id); // Missed webhook
}
}
}
}, 3600000); // 1 hour

ADR-004: In-Memory State Sanitization

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Security-critical)

Context

Terraform state files contain sensitive data:

  • API keys, passwords, database credentials
  • Service account keys (GCP, AWS)
  • Private IP addresses and internal hostnames
  • Personally Identifiable Information (PII)

This data must never be persisted to disk or reach the Backstage catalog. Regulatory compliance (GDPR, CCPA, PCI-DSS) requires demonstrable sanitization.

Options Considered:

  1. In-Memory Sanitization: Process state in RAM, never write unredacted to disk
  2. Temporary File Sanitization: Write to disk, sanitize, then delete
  3. Database-Side Sanitization: Store raw state, sanitize on read (❌ unacceptable)

Decision

Sanitize Terraform state in-memory only, never persisting unredacted state to disk or database.

Rationale

Advantages:

  • Security: No plaintext secrets on disk (reduces attack surface)
  • Compliance: GDPR/CCPA "right to be forgotten" - no persistent PII
  • Performance: In-memory processing 3-5x faster than disk I/O
  • Auditability: All redactions logged before persistence

Sanitization Process:

// Worker receives state JSON from Terraform Cloud
const stateJson = await fetch(stateDownloadUrl).then(r => r.json());

// Load tenant-specific sanitization rules
const sanitizer = new StateSanitizer(tenantId);

// Sanitize in-memory (never write unredacted state to disk)
const { sanitized, violations } = await sanitizer.sanitize(stateJson);

// Log all redactions for audit
await auditLog.record({
workspace_id: workspace.id,
violations: violations, // [{ path: "resource.password", rule: "credential" }]
pii_count: violations.filter(v => v.category === 'pii').length,
credential_count: violations.filter(v => v.category === 'credential').length
});

// Persist ONLY sanitized state
await db.query('INSERT INTO catalog_entities ...', [sanitized]);

Sanitization Rules:

const defaultRules: SanitizationRule[] = [
{
id: 'strip-passwords',
pattern: /(password|passwd|pwd)\s*[:=]\s*["']?([^"'\s]+)/i,
action: 'redact',
category: 'credential'
},
{
id: 'strip-api-keys',
pattern: /[a-zA-Z0-9]{32,}/, // High-entropy strings
min_entropy: 4.5,
action: 'redact',
category: 'credential'
},
{
id: 'strip-emails',
pattern: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/,
action: 'redact',
category: 'pii'
}
];

Consequences

Positive:

  • Stronger security posture (no unredacted state persisted)
  • Compliance-ready (GDPR Article 32: "pseudonymisation and encryption")
  • Faster processing (no disk I/O for temporary files)

Negative:

  • Higher memory requirements: Each worker needs 2-4GB RAM (state files 100-500MB)
  • Complex sanitization logic: Regex rules require tuning to avoid false positives
  • Loss of raw state: Cannot "undo" over-aggressive redactions

Mitigation:

  • Worker pod memory limits: 4GB per pod
  • Configurable allowlists: Tenants can mark specific patterns as "safe"
  • Audit trail: Log all redactions for manual review
  • Dead letter queue: States with excessive violations (> 100) go to manual review

Memory Management:

# Kubernetes resource limits
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m

ADR-005: PostgreSQL Partitioning by Tenant

Status: ✅ Accepted Date: 2024-11-13 Impact: Medium (Performance Optimization)

Context

At 100 clients with 1000 entities each (100,000 total entities), full table scans become slow. Query latency exceeds 500ms for SELECT * FROM catalog_entities WHERE tenant_id = ? without partitioning.

Target Performance:

  • < 100ms p95 query latency at 100K entities
  • < 200ms p95 query latency at 1M entities

Decision

Implement hash partitioning by tenant_id with 10 partitions.

Rationale

Advantages:

  • Query Performance: Partition pruning reduces scanned rows by 90%
  • Index Efficiency: Smaller indexes per partition (10x faster lookups)
  • Maintenance: Partition-level VACUUM and ANALYZE
  • Scalability: Add partitions as tenant count grows

Partitioning Strategy:

-- Create partitioned table
CREATE TABLE catalog_entities (
entity_id UUID DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
entity_ref VARCHAR(500) NOT NULL,
metadata JSONB NOT NULL,
...
) PARTITION BY HASH (tenant_id);

-- Create 10 partitions (adjust based on tenant count)
CREATE TABLE catalog_entities_p0 PARTITION OF catalog_entities
FOR VALUES WITH (MODULUS 10, REMAINDER 0);
CREATE TABLE catalog_entities_p1 PARTITION OF catalog_entities
FOR VALUES WITH (MODULUS 10, REMAINDER 1);
-- ... create p2 through p9

-- Index on each partition
CREATE INDEX idx_entities_p0_ref ON catalog_entities_p0(entity_ref);
-- ... create indexes on p1 through p9

Query Plan (with partitioning):

EXPLAIN SELECT * FROM catalog_entities WHERE tenant_id = '123e4567-e89b-12d3-a456-426614174000';

-- Result: Partition pruning activated, only 1 partition scanned
Seq Scan on catalog_entities_p3 (rows=1000, actual=85ms)
Filter: (tenant_id = '123e4567-e89b-12d3-a456-426614174000')

Consequences

Positive:

  • 10x query performance improvement (500ms → 50ms)
  • Better cache utilization (smaller working set)
  • Isolation of "hot" tenants (large tenants in separate partitions)

Negative:

  • Schema complexity (10 partition tables instead of 1)
  • Migration overhead (partitioning existing data requires downtime)
  • Cross-partition queries slower (rare in multi-tenant app)

Mitigation:

  • Partition count configurable (start with 10, increase to 50 at 500+ tenants)
  • Online partitioning (PostgreSQL 11+): Add partitions without downtime
  • Automated partition management: Add new partitions when tenant count grows

ADR-006: Redis for Catalog Caching

Status: ✅ Accepted Date: 2024-11-13 Impact: Medium (Performance Optimization)

Context

Backstage UI makes frequent catalog queries (10-50 requests/page load). Database queries for the same entity (e.g., "list all components") are repetitive and slow (100-200ms). Target: < 50ms p95 API latency.

Decision

Implement Redis caching with cache-aside pattern and 5-minute TTL.

Rationale

Advantages:

  • Latency: Cache hits respond in 5-10ms (vs. 100-200ms database query)
  • Database Load: Offload 70-80% of reads to cache
  • Cost: Redis cheaper than additional database read replicas

Cache Strategy:

// Cache-aside pattern
async function getCatalogEntity(tenantId: string, entityRef: string): Promise<Entity> {
const cacheKey = `entity:${tenantId}:${entityRef}`;

// Try cache first
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached); // Cache hit (5ms)
}

// Cache miss: query database
const entity = await db.query(
'SELECT * FROM catalog_entities WHERE tenant_id = $1 AND entity_ref = $2',
[tenantId, entityRef]
);

// Store in cache with TTL
await redis.setex(cacheKey, 300, JSON.stringify(entity)); // 5 minutes

return entity;
}

Cache Invalidation:

  • Write-through: Update cache on every database write
  • TTL-based: Expire after 5 minutes (eventual consistency acceptable)
  • Manual invalidation: Webhook events trigger cache delete

Consequences

Positive:

  • 70% cache hit rate achieves 50ms p95 API latency
  • Database CPU usage reduced by 60%
  • Better user experience (faster page loads)

Negative:

  • Stale data: 5-minute TTL means catalog may be outdated
  • Memory cost: Redis cluster ($200/month for 10GB)
  • Complexity: Cache invalidation logic required

Mitigation:

  • Webhook-driven invalidation for critical entities
  • Monitoring: Alert if cache hit rate < 60%
  • Redis Cluster for high availability

ADR-007: Google Cloud Platform as Primary Provider

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Infrastructure)

Context

Need to select cloud provider for hosting multi-tenant Backstage plugin. Requirements:

  • Managed Kubernetes (GKE, EKS, AKS)
  • Managed PostgreSQL with high availability
  • Managed message queue (Pub/Sub, SQS, Service Bus)
  • Secret management with automatic rotation

Decision

Deploy on Google Cloud Platform (GCP) with multi-cloud abstraction layer.

Rationale

Advantages:

  • GKE: Best-in-class Kubernetes (autopilot mode, auto-scaling)
  • Cloud SQL: Managed PostgreSQL with RLS support, CMEK encryption
  • Pub/Sub: Serverless message queue (vs. self-hosted RabbitMQ)
  • Integration: Native IAM integration (no service account keys)
  • Cost: 20% cheaper than AWS for equivalent workload

Cost Comparison (100 clients):

ServiceGCPAWSDifference
Kubernetes$2,500$3,200-$700
PostgreSQL HA$800$1,100-$300
Message Queue$400$500-$100
Total$4,000$5,100-$1,100

Consequences

Positive:

  • Lower operational cost (20% savings)
  • Better Kubernetes experience (GKE autopilot)
  • Managed services reduce maintenance burden

Negative:

  • GCP vendor lock-in: Migrating to AWS requires re-architecture
  • Learning curve: Team must learn GCP-specific services

Mitigation:

  • Multi-cloud abstraction layer (support AWS in future)
  • Infrastructure-as-code (Terraform) for portability
  • Avoid GCP-specific features (use standard Kubernetes APIs)

ADR-008: Node.js 20 LTS for Backend Runtime

Status: ✅ Accepted Date: 2024-11-13 Impact: Low (Technology Choice)

Context

Backstage is built on Node.js/TypeScript. Need to select runtime version for backend plugin.

Decision

Use Node.js 20 LTS (Long-Term Support).

Rationale

Advantages:

  • Compatibility: Backstage officially supports Node.js 18-20
  • Performance: 10-20% faster than Node.js 16
  • Security: LTS receives security patches until April 2026
  • Modern Features: Native fetch API, test runner

Consequences

Positive:

  • Faster API response times (native performance improvements)
  • Better developer experience (modern JavaScript features)
  • Long-term support (no forced upgrade for 2+ years)

Negative:

  • Must upgrade from Node.js 16 (some dependencies may break)

Mitigation:

  • Test all dependencies with Node.js 20 before deployment
  • Automated CI/CD tests on Node.js 20

ADR-009: Kubernetes (GKE) for Container Orchestration

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Infrastructure)

Context

Need container orchestration for scaling backend pods (3-50 replicas), worker pods (5-100 replicas), and webhook handlers (2-20 replicas).

Decision

Deploy on Google Kubernetes Engine (GKE) with Autopilot mode.

Rationale

Advantages:

  • Auto-Scaling: Horizontal Pod Autoscaler (HPA) + Cluster Autoscaler
  • High Availability: Multi-zone pod distribution
  • Managed: Google manages control plane, node upgrades
  • Integration: Native GCP service integration (Cloud SQL, Pub/Sub)

Consequences

Positive:

  • Zero-downtime deployments (rolling updates)
  • Elastic capacity (auto-scale from 3 to 50 pods)
  • Simplified operations (no manual node management)

Negative:

  • Kubernetes complexity (steep learning curve)
  • Higher cost than VMs (but offset by auto-scaling savings)

Mitigation:

  • Use Backstage Kubernetes plugin for visibility
  • Infrastructure-as-code (Terraform) for reproducibility

ADR-010: Managed Prometheus for Observability

Status: ✅ Accepted Date: 2024-11-13 Impact: Low (Observability)

Context

Need metrics collection for monitoring API latency, error rates, queue depth, cache hit rates.

Decision

Use Google Cloud Managed Service for Prometheus.

Rationale

Advantages:

  • Managed: No self-hosted Prometheus cluster
  • Scalability: Handles millions of metrics
  • Integration: Native GKE integration (auto-discovery)

Consequences

Positive:

  • Zero operational overhead
  • High availability (99.9% SLA)

Negative:

  • GCP vendor lock-in (vs. self-hosted Prometheus)

Mitigation:

  • Use standard Prometheus metrics format (portable to other backends)

Decision Matrix

DecisionSecurity ImpactCost ImpactComplexityReversibility
ADR-001: RLS✅ High✅ LowMediumHard
ADR-002: Pub/SubMedium✅ LowLowMedium
ADR-003: WebhooksMedium✅ LowMediumEasy
ADR-004: In-Memory Sanitization✅ HighMediumHighHard
ADR-005: PartitioningLow✅ LowMediumHard
ADR-006: Redis CacheLowMediumLowEasy
ADR-007: GCPLow✅ HighMediumHard
ADR-008: Node.js 20Low✅ LowLowEasy
ADR-009: GKELowMediumHighHard
ADR-010: Managed PrometheusLow✅ LowLowEasy

Document Maintenance:

  • Review ADRs quarterly
  • Update status when decisions superseded
  • Add new ADRs for significant architectural changes

Related Documents:


Last Updated: 2024-11-13 Next Review: 2025-02-13