Architecture Decision Records (ADR) Summary

Multi-Tenant Backstage Terraform Cloud Plugin

Document Version: 1.0 Last Updated: 2024-11-13

Overview

This document summarizes all architectural decisions made for the multi-tenant Backstage plugin integrating with Terraform Cloud. Each ADR follows the format:

Status: Accepted | Proposed | Deprecated | Superseded
Context: Problem statement and constraints
Decision: What was decided
Rationale: Why this decision was made
Consequences: Trade-offs and implications

ADR Index

ID	Title	Status	Date	Impact
ADR-001	Row-Level Security for Tenant Isolation	Accepted	2024-11-13	High
ADR-002	Pub/Sub for Asynchronous Processing	Accepted	2024-11-13	High
ADR-003	Real-Time Sync via Terraform Cloud Webhooks	Accepted	2024-11-13	Medium
ADR-004	In-Memory State Sanitization	Accepted	2024-11-13	High
ADR-005	PostgreSQL Partitioning by Tenant	Accepted	2024-11-13	Medium
ADR-006	Redis for Catalog Caching	Accepted	2024-11-13	Medium
ADR-007	Google Cloud Platform as Primary Provider	Accepted	2024-11-13	High
ADR-008	Node.js 20 LTS for Backend Runtime	Accepted	2024-11-13	Low
ADR-009	Kubernetes (GKE) for Container Orchestration	Accepted	2024-11-13	High
ADR-010	Managed Prometheus for Observability	Accepted	2024-11-13	Low

ADR-001: Row-Level Security for Tenant Isolation

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Security-critical)

Context

Multi-tenant SaaS application requires strict data isolation between enterprise clients. Each tenant must only access their own data, with zero risk of cross-tenant data leaks. Regulatory compliance (SOC 2, GDPR) requires database-enforced isolation, not just application-layer checks.

Options Considered:

Separate Databases per Tenant: Maximum isolation, but high operational cost (100+ databases)
Separate Schemas per Tenant: Good isolation, but complex migrations
Row-Level Security (RLS): Database-enforced isolation in single database
Discriminator Column Only: Application-layer filtering (insufficient security)

Decision

Implement PostgreSQL Row-Level Security (RLS) with tenant_id discriminator column.

Rationale

Advantages:

Security: Database enforces isolation via SQL policies (not application code)
Cost: Single database instance reduces operational overhead by 95%
Performance: Indexed tenant_id column + partitioning maintains query speed
Compliance: Meets SOC 2 Type II data isolation requirements
Simplicity: Single backup/restore process, no complex multi-DB orchestration

Implementation:

-- Enable RLS on catalog_entities table
ALTER TABLE catalog_entities ENABLE ROW LEVEL SECURITY;

-- Policy: Users can only see their tenant's data
CREATE POLICY tenant_isolation_policy ON catalog_entities
  USING (tenant_id = current_setting('app.current_tenant')::UUID);

Tenant Context Injection:

// Middleware sets PostgreSQL session variable
await db.query('SET LOCAL app.current_tenant = $1', [tenantId]);

Consequences

Positive:

Single database simplifies backups, migrations, and disaster recovery
Lower infrastructure cost: $800/month vs. $80,000/month (100 databases)
Automatic enforcement (no risk of application bugs bypassing checks)

Negative:

PostgreSQL version dependency (requires 9.5+, already met)
RLS policy complexity increases with authorization logic
Slight performance overhead (~5-10ms per query)

Mitigation:

Comprehensive RLS policy testing in CI/CD
Automated leak detection (hourly job checks for cross-tenant entities)
Regular security audits of RLS policies

Monitoring:

-- Detect cross-tenant leaks (alert if any results)
SELECT entity_ref, COUNT(DISTINCT tenant_id) as tenant_count
FROM catalog_entities
GROUP BY entity_ref
HAVING COUNT(DISTINCT tenant_id) > 1;

ADR-002: Pub/Sub for Asynchronous Processing

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Scalability-critical)

Context

System must handle burst traffic from Terraform Cloud webhooks (1000+ workspace updates within minutes). Synchronous processing would block API requests and cause timeouts. Need decoupling between webhook ingestion and state processing.

Options Considered:

Google Cloud Pub/Sub: Managed, serverless, auto-scaling
RabbitMQ (self-hosted): Full control, complex operations
Apache Kafka (self-hosted): High throughput, heavy operational overhead
Database as Queue (SKIP LOCKED): Simple, but doesn't scale

Decision

Use Google Cloud Pub/Sub for all asynchronous task distribution.

Rationale

Advantages:

Scalability: Managed service scales to millions of messages/sec automatically
Reliability: At-least-once delivery guarantee with dead letter queues
Integration: Native GCP integration (IAM auth, Cloud Monitoring)
Cost: Pay-per-use ($40/million messages vs. $500/month for RabbitMQ VMs)
Operations: Zero maintenance (no cluster management, no broker failures)

Message Flow:

Topics:

terraform-workspace-discovered - New workspace found via GitHub scan
terraform-state-updated - State version changed (webhook trigger)
sanitization-failed - Dead letter queue for manual review

Consequences

Positive:

Decoupled architecture: Webhook handlers respond in < 100ms
Elastic capacity: Automatically handles burst traffic (1000 msg/sec)
Retry logic: Exponential backoff (10s → 600s) for transient failures

Negative:

At-least-once delivery: Workers must be idempotent (duplicate messages possible)
GCP vendor lock-in: Migrating to Kafka/RabbitMQ requires code changes
Cost at extreme scale: $40/million messages (10M messages = $400/month)

Mitigation:

Idempotent message handlers (check entity updated_at timestamp)
Multi-cloud abstraction layer (future: support AWS SQS, Azure Service Bus)
Cost monitoring: Alert if message volume exceeds 5M/day

Idempotent Handler Example:

async function handleStateUpdate(message: StateUpdateMessage) {
  const { workspace_id, state_version } = message;

  // Check if already processed
  const existing = await db.query(
    'SELECT state_version FROM catalog_entities WHERE workspace_id = $1',
    [workspace_id]
  );

  if (existing.state_version === state_version) {
    console.log('Duplicate message, skipping');
    return; // Already processed
  }

  // Process state update...
}

ADR-003: Real-Time Sync via Terraform Cloud Webhooks

Status: ✅ Accepted Date: 2024-11-13 Impact: Medium (User Experience)

Context

Users expect infrastructure changes to appear in Backstage catalog within 5 minutes of terraform apply. Polling Terraform Cloud every 5 minutes would require 20,000 API calls/hour (20,000 workspaces × 12 polls/hour), hitting rate limits.

Options Considered:

Webhooks (Event-Driven): Near-instant updates, low API usage
Polling (Every 5 minutes): Simple, but high API cost and latency
Hybrid (Webhooks + Hourly Poll): Best of both worlds

Decision

Use Terraform Cloud webhooks as primary sync mechanism, with hourly polling as fallback.

Rationale

Advantages:

Latency: Webhook delivers event in < 30 seconds (vs. 2.5 minutes average for polling)
Efficiency: Event-driven reduces API calls by 95% (20K → 1K calls/hour)
Cost: Lower Terraform Cloud API usage (rate limit budget preserved)
User Experience: "Live" catalog updates instead of stale data

Webhook Configuration:

# Per-tenant Terraform Cloud webhook
webhooks:
  - name: "backstage-catalog-sync"
    url: "https://plugin.backstage.example.com/api/webhooks/terraform-cloud"
    token: "<hmac-secret>"
    events:
      - "run:completed"
      - "run:errored"
      - "workspace:created"
      - "workspace:deleted"

Security:

HMAC SHA-256 signature verification
Timestamp validation (reject if > 5 minutes old)
IP allowlist (Terraform Cloud egress IPs)

Consequences

Positive:

Near-real-time catalog updates (< 5 minutes → < 30 seconds)
95% reduction in Terraform Cloud API calls
Better user experience (perceived "live" updates)

Negative:

Webhook reliability dependency: If Terraform Cloud webhook delivery fails, catalog stale
HMAC verification overhead: Every webhook request requires signature check
Replay attack surface: Must track nonces to prevent replay

Mitigation:

Fallback polling: Hourly full sync job catches missed webhooks
Webhook retry logic: Terraform Cloud retries up to 3 times with exponential backoff
Monitoring: Alert if no webhooks received for > 2 hours

Fallback Polling Job:

// Run every hour as safety net
setInterval(async () => {
  console.log('Starting hourly full sync (fallback)...');

  for (const tenant of tenants) {
    const workspaces = await terraformCloud.listWorkspaces(tenant.org);

    for (const workspace of workspaces) {
      const lastSyncTime = await getLastSyncTime(workspace.id);
      const hoursSinceSync = (Date.now() - lastSyncTime) / (1000 * 60 * 60);

      if (hoursSinceSync > 1) {
        await enqueueStateUpdate(workspace.id); // Missed webhook
      }
    }
  }
}, 3600000); // 1 hour

ADR-004: In-Memory State Sanitization

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Security-critical)

Context

Terraform state files contain sensitive data:

API keys, passwords, database credentials
Service account keys (GCP, AWS)
Private IP addresses and internal hostnames
Personally Identifiable Information (PII)

This data must never be persisted to disk or reach the Backstage catalog. Regulatory compliance (GDPR, CCPA, PCI-DSS) requires demonstrable sanitization.

Options Considered:

In-Memory Sanitization: Process state in RAM, never write unredacted to disk
Temporary File Sanitization: Write to disk, sanitize, then delete
Database-Side Sanitization: Store raw state, sanitize on read (❌ unacceptable)

Decision

Sanitize Terraform state in-memory only, never persisting unredacted state to disk or database.

Rationale

Advantages:

Security: No plaintext secrets on disk (reduces attack surface)
Compliance: GDPR/CCPA "right to be forgotten" - no persistent PII
Performance: In-memory processing 3-5x faster than disk I/O
Auditability: All redactions logged before persistence

Sanitization Process:

// Worker receives state JSON from Terraform Cloud
const stateJson = await fetch(stateDownloadUrl).then(r => r.json());

// Load tenant-specific sanitization rules
const sanitizer = new StateSanitizer(tenantId);

// Sanitize in-memory (never write unredacted state to disk)
const { sanitized, violations } = await sanitizer.sanitize(stateJson);

// Log all redactions for audit
await auditLog.record({
  workspace_id: workspace.id,
  violations: violations,  // [{ path: "resource.password", rule: "credential" }]
  pii_count: violations.filter(v => v.category === 'pii').length,
  credential_count: violations.filter(v => v.category === 'credential').length
});

// Persist ONLY sanitized state
await db.query('INSERT INTO catalog_entities ...', [sanitized]);

Sanitization Rules:

const defaultRules: SanitizationRule[] = [
  {
    id: 'strip-passwords',
    pattern: /(password|passwd|pwd)\s*[:=]\s*["']?([^"'\s]+)/i,
    action: 'redact',
    category: 'credential'
  },
  {
    id: 'strip-api-keys',
    pattern: /[a-zA-Z0-9]{32,}/,  // High-entropy strings
    min_entropy: 4.5,
    action: 'redact',
    category: 'credential'
  },
  {
    id: 'strip-emails',
    pattern: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/,
    action: 'redact',
    category: 'pii'
  }
];

Consequences

Positive:

Stronger security posture (no unredacted state persisted)
Compliance-ready (GDPR Article 32: "pseudonymisation and encryption")
Faster processing (no disk I/O for temporary files)

Negative:

Higher memory requirements: Each worker needs 2-4GB RAM (state files 100-500MB)
Complex sanitization logic: Regex rules require tuning to avoid false positives
Loss of raw state: Cannot "undo" over-aggressive redactions

Mitigation:

Worker pod memory limits: 4GB per pod
Configurable allowlists: Tenants can mark specific patterns as "safe"
Audit trail: Log all redactions for manual review
Dead letter queue: States with excessive violations (> 100) go to manual review

Memory Management:

# Kubernetes resource limits
resources:
  requests:
    memory: 2Gi
    cpu: 1000m
  limits:
    memory: 4Gi
    cpu: 2000m

ADR-005: PostgreSQL Partitioning by Tenant

Status: ✅ Accepted Date: 2024-11-13 Impact: Medium (Performance Optimization)

Context

At 100 clients with 1000 entities each (100,000 total entities), full table scans become slow. Query latency exceeds 500ms for SELECT * FROM catalog_entities WHERE tenant_id = ? without partitioning.

Target Performance:

< 100ms p95 query latency at 100K entities
< 200ms p95 query latency at 1M entities

Decision

Implement hash partitioning by tenant_id with 10 partitions.

Rationale

Advantages:

Query Performance: Partition pruning reduces scanned rows by 90%
Index Efficiency: Smaller indexes per partition (10x faster lookups)
Maintenance: Partition-level VACUUM and ANALYZE
Scalability: Add partitions as tenant count grows

Partitioning Strategy:

-- Create partitioned table
CREATE TABLE catalog_entities (
  entity_id UUID DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  entity_ref VARCHAR(500) NOT NULL,
  metadata JSONB NOT NULL,
  ...
) PARTITION BY HASH (tenant_id);

-- Create 10 partitions (adjust based on tenant count)
CREATE TABLE catalog_entities_p0 PARTITION OF catalog_entities
  FOR VALUES WITH (MODULUS 10, REMAINDER 0);
CREATE TABLE catalog_entities_p1 PARTITION OF catalog_entities
  FOR VALUES WITH (MODULUS 10, REMAINDER 1);
-- ... create p2 through p9

-- Index on each partition
CREATE INDEX idx_entities_p0_ref ON catalog_entities_p0(entity_ref);
-- ... create indexes on p1 through p9

Query Plan (with partitioning):

EXPLAIN SELECT * FROM catalog_entities WHERE tenant_id = '123e4567-e89b-12d3-a456-426614174000';

-- Result: Partition pruning activated, only 1 partition scanned
Seq Scan on catalog_entities_p3  (rows=1000, actual=85ms)
  Filter: (tenant_id = '123e4567-e89b-12d3-a456-426614174000')

Consequences

Positive:

10x query performance improvement (500ms → 50ms)
Better cache utilization (smaller working set)
Isolation of "hot" tenants (large tenants in separate partitions)

Negative:

Schema complexity (10 partition tables instead of 1)
Migration overhead (partitioning existing data requires downtime)
Cross-partition queries slower (rare in multi-tenant app)

Mitigation:

Partition count configurable (start with 10, increase to 50 at 500+ tenants)
Online partitioning (PostgreSQL 11+): Add partitions without downtime
Automated partition management: Add new partitions when tenant count grows

ADR-006: Redis for Catalog Caching

Status: ✅ Accepted Date: 2024-11-13 Impact: Medium (Performance Optimization)

Context

Backstage UI makes frequent catalog queries (10-50 requests/page load). Database queries for the same entity (e.g., "list all components") are repetitive and slow (100-200ms). Target: < 50ms p95 API latency.

Decision

Implement Redis caching with cache-aside pattern and 5-minute TTL.

Rationale

Advantages:

Latency: Cache hits respond in 5-10ms (vs. 100-200ms database query)
Database Load: Offload 70-80% of reads to cache
Cost: Redis cheaper than additional database read replicas

Cache Strategy:

// Cache-aside pattern
async function getCatalogEntity(tenantId: string, entityRef: string): Promise<Entity> {
  const cacheKey = `entity:${tenantId}:${entityRef}`;

  // Try cache first
  const cached = await redis.get(cacheKey);
  if (cached) {
    return JSON.parse(cached); // Cache hit (5ms)
  }

  // Cache miss: query database
  const entity = await db.query(
    'SELECT * FROM catalog_entities WHERE tenant_id = $1 AND entity_ref = $2',
    [tenantId, entityRef]
  );

  // Store in cache with TTL
  await redis.setex(cacheKey, 300, JSON.stringify(entity)); // 5 minutes

  return entity;
}

Cache Invalidation:

Write-through: Update cache on every database write
TTL-based: Expire after 5 minutes (eventual consistency acceptable)
Manual invalidation: Webhook events trigger cache delete

Consequences

Positive:

70% cache hit rate achieves 50ms p95 API latency
Database CPU usage reduced by 60%
Better user experience (faster page loads)

Negative:

Stale data: 5-minute TTL means catalog may be outdated
Memory cost: Redis cluster ($200/month for 10GB)
Complexity: Cache invalidation logic required

Mitigation:

Webhook-driven invalidation for critical entities
Monitoring: Alert if cache hit rate < 60%
Redis Cluster for high availability

ADR-007: Google Cloud Platform as Primary Provider

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Infrastructure)

Context

Need to select cloud provider for hosting multi-tenant Backstage plugin. Requirements:

Managed Kubernetes (GKE, EKS, AKS)
Managed PostgreSQL with high availability
Managed message queue (Pub/Sub, SQS, Service Bus)
Secret management with automatic rotation

Decision

Deploy on Google Cloud Platform (GCP) with multi-cloud abstraction layer.

Rationale

Advantages:

GKE: Best-in-class Kubernetes (autopilot mode, auto-scaling)
Cloud SQL: Managed PostgreSQL with RLS support, CMEK encryption
Pub/Sub: Serverless message queue (vs. self-hosted RabbitMQ)
Integration: Native IAM integration (no service account keys)
Cost: 20% cheaper than AWS for equivalent workload

Cost Comparison (100 clients):

Service	GCP	AWS	Difference
Kubernetes	$2,500	$3,200	-$700
PostgreSQL HA	$800	$1,100	-$300
Message Queue	$400	$500	-$100
Total	$4,000	$5,100	-$1,100

Consequences

Positive:

Lower operational cost (20% savings)
Better Kubernetes experience (GKE autopilot)
Managed services reduce maintenance burden

Negative:

GCP vendor lock-in: Migrating to AWS requires re-architecture
Learning curve: Team must learn GCP-specific services

Mitigation:

Multi-cloud abstraction layer (support AWS in future)
Infrastructure-as-code (Terraform) for portability
Avoid GCP-specific features (use standard Kubernetes APIs)

ADR-008: Node.js 20 LTS for Backend Runtime

Status: ✅ Accepted Date: 2024-11-13 Impact: Low (Technology Choice)

Context

Backstage is built on Node.js/TypeScript. Need to select runtime version for backend plugin.

Decision

Use Node.js 20 LTS (Long-Term Support).

Rationale

Advantages:

Compatibility: Backstage officially supports Node.js 18-20
Performance: 10-20% faster than Node.js 16
Security: LTS receives security patches until April 2026
Modern Features: Native fetch API, test runner

Consequences

Positive:

Faster API response times (native performance improvements)
Better developer experience (modern JavaScript features)
Long-term support (no forced upgrade for 2+ years)

Negative:

Must upgrade from Node.js 16 (some dependencies may break)

Mitigation:

Test all dependencies with Node.js 20 before deployment
Automated CI/CD tests on Node.js 20

ADR-009: Kubernetes (GKE) for Container Orchestration

Status: ✅ Accepted Date: 2024-11-13 Impact: High (Infrastructure)

Context

Need container orchestration for scaling backend pods (3-50 replicas), worker pods (5-100 replicas), and webhook handlers (2-20 replicas).

Decision

Deploy on Google Kubernetes Engine (GKE) with Autopilot mode.

Rationale

Advantages:

Auto-Scaling: Horizontal Pod Autoscaler (HPA) + Cluster Autoscaler
High Availability: Multi-zone pod distribution
Managed: Google manages control plane, node upgrades
Integration: Native GCP service integration (Cloud SQL, Pub/Sub)

Consequences

Positive:

Zero-downtime deployments (rolling updates)
Elastic capacity (auto-scale from 3 to 50 pods)
Simplified operations (no manual node management)

Negative:

Kubernetes complexity (steep learning curve)
Higher cost than VMs (but offset by auto-scaling savings)

Mitigation:

Use Backstage Kubernetes plugin for visibility
Infrastructure-as-code (Terraform) for reproducibility

ADR-010: Managed Prometheus for Observability

Status: ✅ Accepted Date: 2024-11-13 Impact: Low (Observability)

Context

Need metrics collection for monitoring API latency, error rates, queue depth, cache hit rates.

Decision

Use Google Cloud Managed Service for Prometheus.

Rationale

Advantages:

Managed: No self-hosted Prometheus cluster
Scalability: Handles millions of metrics
Integration: Native GKE integration (auto-discovery)

Consequences

Positive:

Zero operational overhead
High availability (99.9% SLA)

Negative:

GCP vendor lock-in (vs. self-hosted Prometheus)

Mitigation:

Use standard Prometheus metrics format (portable to other backends)

Decision Matrix

Decision	Security Impact	Cost Impact	Complexity	Reversibility
ADR-001: RLS	✅ High	✅ Low	Medium	Hard
ADR-002: Pub/Sub	Medium	✅ Low	Low	Medium
ADR-003: Webhooks	Medium	✅ Low	Medium	Easy
ADR-004: In-Memory Sanitization	✅ High	Medium	High	Hard
ADR-005: Partitioning	Low	✅ Low	Medium	Hard
ADR-006: Redis Cache	Low	Medium	Low	Easy
ADR-007: GCP	Low	✅ High	Medium	Hard
ADR-008: Node.js 20	Low	✅ Low	Low	Easy
ADR-009: GKE	Low	Medium	High	Hard
ADR-010: Managed Prometheus	Low	✅ Low	Low	Easy

Document Maintenance:

Review ADRs quarterly
Update status when decisions superseded
Add new ADRs for significant architectural changes

Related Documents:

Last Updated: 2024-11-13 Next Review: 2025-02-13

Multi-Tenant Backstage Terraform Cloud Plugin​

Overview​

ADR Index​

ADR-001: Row-Level Security for Tenant Isolation​

Context​

Decision​

Rationale​

Consequences​

ADR-002: Pub/Sub for Asynchronous Processing​

Context​

Decision​

Rationale​

Consequences​

ADR-003: Real-Time Sync via Terraform Cloud Webhooks​

Context​

Decision​

Rationale​

Consequences​

ADR-004: In-Memory State Sanitization​

Context​

Decision​

Rationale​

Consequences​

ADR-005: PostgreSQL Partitioning by Tenant​

Context​

Decision​

Rationale​

Consequences​

ADR-006: Redis for Catalog Caching​

Context​

Decision​

Rationale​

Consequences​

ADR-007: Google Cloud Platform as Primary Provider​

Context​

Decision​

Rationale​

Consequences​

ADR-008: Node.js 20 LTS for Backend Runtime​

Context​

Decision​

Rationale​

Consequences​

ADR-009: Kubernetes (GKE) for Container Orchestration​

Context​

Decision​

Rationale​

Consequences​

ADR-010: Managed Prometheus for Observability​

Context​

Decision​

Rationale​

Consequences​

Decision Matrix​

Multi-Tenant Backstage Terraform Cloud Plugin

Overview

ADR Index

ADR-001: Row-Level Security for Tenant Isolation

Context

Decision

Rationale

Consequences

ADR-002: Pub/Sub for Asynchronous Processing

Context

Decision

Rationale

Consequences

ADR-003: Real-Time Sync via Terraform Cloud Webhooks

Context

Decision

Rationale

Consequences

ADR-004: In-Memory State Sanitization

Context

Decision

Rationale

Consequences

ADR-005: PostgreSQL Partitioning by Tenant

Context

Decision

Rationale

Consequences

ADR-006: Redis for Catalog Caching

Context

Decision

Rationale

Consequences

ADR-007: Google Cloud Platform as Primary Provider

Context

Decision

Rationale

Consequences

ADR-008: Node.js 20 LTS for Backend Runtime

Context

Decision

Rationale

Consequences

ADR-009: Kubernetes (GKE) for Container Orchestration

Context

Decision

Rationale

Consequences

ADR-010: Managed Prometheus for Observability

Context

Decision

Rationale

Consequences

Decision Matrix