Enterprise Multi-Tenant Backstage Plugin Architecture

Terraform Cloud Integration for SaaS Platform

Document Version: 1.0 Date: November 13, 2024 Status: Architecture Design - Enterprise Edition Classification: Internal Technical Design

Executive Summary

This document defines the enterprise-grade architecture for a multi-tenant Backstage plugin that integrates with Terraform Cloud to provide infrastructure catalog management across hundreds of client organizations. The design prioritizes security, scalability, tenant isolation, and real-time synchronization while maintaining a shared SaaS codebase.

Key Metrics & Targets

Metric	Target	Scale Factor
Clients Supported	100+ enterprises	10x current
Workspaces per Client	50-500	Variable
Total Entities	100,000+	10x current
API Rate Limit	30 req/sec (Terraform Cloud)	Shared resource
Sync Latency	< 5 minutes (real-time)	Event-driven
Data Isolation	100% (zero cross-tenant leaks)	Critical
Uptime SLA	99.9%	Enterprise-grade

1. System Context & Requirements

1.1 Business Context

Problem Statement:

Multiple enterprise clients need infrastructure visibility in Backstage
Each client has 10-100 business units with independent infrastructure repositories
Business units are dynamically created (M&A, reorganization, new initiatives)
Current manual catalog maintenance doesn't scale beyond 50 repositories
Security teams require sensitive data never appears in Backstage

Solution Vision: A multi-tenant SaaS Backstage plugin that:

Automatically discovers infrastructure repositories across GitHub organizations
Pulls Terraform state from Terraform Cloud workspaces
Sanitizes sensitive data before catalog ingestion
Maintains strict tenant isolation in shared database
Updates catalog in near-real-time (< 5 minutes)
Scales to 100+ clients with 1000+ repositories each

1.2 Critical Requirements

Functional Requirements

Terraform Cloud Integration
- Authenticate via organization tokens, team tokens, or user tokens
- Discover all workspaces across multiple organizations
- Download latest state versions via API
- Subscribe to workspace run webhooks for real-time updates
- Handle pagination (100+ workspace pages)
Multi-Tenant Data Isolation
- Client data NEVER visible to other clients
- Separate authentication per client (API keys, JWT tokens)
- Audit logging of all cross-tenant access attempts
- Configurable tenant-specific sanitization rules
State Sanitization
- Detect and redact PII (emails, names, addresses)
- Strip credentials (passwords, API keys, tokens, certificates)
- Remove service account keys and private keys
- Configurable allowlists per client
- Audit trail of all redactions
Dynamic Discovery
- Auto-detect new business unit repositories (GitHub org scanning)
- Extract metadata from catalog-info.yaml
- Link repositories to Terraform Cloud workspaces
- Detect deleted/archived repositories
Scalability
- Process 1000+ workspace updates concurrently
- Queue-based architecture for burst handling
- Database partitioning for 100+ clients
- CDN caching for static catalog data

Non-Functional Requirements

Security
- SOC 2 Type II compliance
- Encryption at rest (AES-256) and in transit (TLS 1.3)
- Zero trust network architecture
- Role-based access control (RBAC) per tenant
Performance
- < 200ms API response time (p95)
- < 5 minute sync latency for state updates
- Support 10,000 concurrent users
- Database queries < 100ms (p95)
Reliability
- 99.9% uptime SLA
- Automatic failover for database and queue
- Graceful degradation under load
- Circuit breakers for external API calls
Observability
- Structured logging with trace IDs
- Metrics dashboards per tenant
- Alerting for anomalies
- Cost tracking per tenant

2. Terraform Cloud Integration Architecture

2.1 API Authentication Strategy

Authentication Hierarchy

Token Type	Scope	Use Case	Rotation
Organization Token	Org-wide read/write	Initial setup, workspace creation	Manual (yearly)
Team Token	Team-scoped read	Read workspace states for specific teams	Automatic (90 days)
User Token	User-scoped read	Development/testing only	Automatic (30 days)

Implementation Details:

// Token hierarchy with automatic fallback
interface TerraformCloudAuth {
  tenantId: string;
  primaryToken: string;      // Organization token
  fallbackTokens: string[];  // Team tokens for redundancy
  rotationPolicy: {
    enabled: boolean;
    intervalDays: number;
    alertBeforeDays: number;
  };
}

// Token storage in Google Secret Manager
const secretPath = `projects/${PROJECT_ID}/secrets/tfc-token-${tenantId}/versions/latest`;

2.2 Workspace Discovery API

Flow:

Organization Listing: GET /api/v2/organizations
Workspace Pagination: GET /api/v2/organizations/{org}/workspaces?page[size]=100&page[number]=1
State Version Retrieval: GET /api/v2/workspaces/{workspace_id}/current-state-version
State Download: GET {state_version.hosted_state_download_url}

Rate Limiting Strategy:

Terraform Cloud: 30 requests/second (shared across all users)
Plugin: Implements token bucket algorithm
Per-tenant quota: 5 requests/second
Burst allowance: 50 requests

// Rate limiter implementation
class TerraformCloudRateLimiter {
  private buckets: Map<string, TokenBucket> = new Map();

  async checkLimit(tenantId: string): Promise<boolean> {
    const bucket = this.buckets.get(tenantId);
    if (!bucket) {
      this.buckets.set(tenantId, new TokenBucket(5, 5)); // 5 req/sec
    }
    return bucket.tryConsume(1);
  }

  async waitForCapacity(tenantId: string): Promise<void> {
    while (!await this.checkLimit(tenantId)) {
      await sleep(200); // Wait 200ms before retry
    }
  }
}

2.3 Webhook Event Processing

Real-Time Updates via Terraform Cloud Webhooks:

# Webhook Configuration
webhooks:
  - name: "backstage-catalog-sync"
    enabled: true
    url: "https://plugin.backstage.example.com/api/webhooks/terraform-cloud"
    token: "<webhook-secret-token>"
    events:
      - "run:completed"
      - "run:errored"
      - "workspace:created"
      - "workspace:deleted"

Event Processing Flow:

Webhook Security:

HMAC SHA-256 signature verification
IP allowlist for Terraform Cloud endpoints
Replay attack prevention (timestamp + nonce)
Rate limiting per tenant (1000 events/hour)

3. Multi-Tenant Data Architecture

3.1 Tenant Isolation Strategy

Option Analysis:

Strategy	Pros	Cons	Recommendation
Separate Databases	Maximum isolation, simple RBAC	High cost, complex backups	❌ Not scalable
Separate Schemas	Good isolation, moderate cost	Schema migrations complex	⚠️ Fallback option
Row-Level Security (RLS)	Cost-effective, single DB	Complex policies, performance overhead	✅ Primary choice
Discriminator Column	Simple to implement	No DB-level isolation, risky	❌ Insufficient security

Selected Approach: PostgreSQL Row-Level Security (RLS) + Tenant Column

-- Tenant isolation table
CREATE TABLE tenants (
  tenant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name VARCHAR(255) NOT NULL,
  org_slug VARCHAR(100) UNIQUE NOT NULL,
  plan_tier VARCHAR(50) DEFAULT 'enterprise',
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

-- Catalog entities with tenant column
CREATE TABLE catalog_entities (
  entity_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
  entity_ref VARCHAR(500) NOT NULL,
  entity_kind VARCHAR(100) NOT NULL,
  entity_name VARCHAR(255) NOT NULL,
  metadata JSONB NOT NULL,
  spec JSONB NOT NULL,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW(),

  CONSTRAINT unique_entity_per_tenant UNIQUE (tenant_id, entity_ref)
);

-- Enable Row-Level Security
ALTER TABLE catalog_entities ENABLE ROW LEVEL SECURITY;

-- Policy: Users can only see their tenant's data
CREATE POLICY tenant_isolation_policy ON catalog_entities
  USING (tenant_id = current_setting('app.current_tenant')::UUID);

-- Index for performance
CREATE INDEX idx_entities_tenant_id ON catalog_entities(tenant_id);
CREATE INDEX idx_entities_kind_name ON catalog_entities(tenant_id, entity_kind, entity_name);

Tenant Context Injection:

// Middleware to set tenant context
app.use(async (req, res, next) => {
  const tenantId = await extractTenantId(req); // From JWT, API key, or header

  if (!tenantId) {
    return res.status(401).json({ error: 'Missing tenant context' });
  }

  // Validate tenant exists and is active
  const tenant = await db.query('SELECT * FROM tenants WHERE tenant_id = $1 AND status = $2',
    [tenantId, 'active']);

  if (!tenant) {
    return res.status(403).json({ error: 'Invalid or inactive tenant' });
  }

  // Set PostgreSQL session variable for RLS
  await db.query('SET LOCAL app.current_tenant = $1', [tenantId]);

  req.tenantId = tenantId;
  next();
});

3.2 Tenant Identification Methods

Priority Order:

JWT Token (Production): Tenant ID embedded in signed JWT claims
API Key (Service-to-Service): Hashed API key mapped to tenant
Header Override (Development): X-Tenant-ID header (disabled in prod)

// JWT Token Structure
interface BackstageTenantJWT {
  sub: string;           // User ID
  tenant_id: string;     // Primary tenant identifier
  tenant_slug: string;   // Human-readable tenant name
  permissions: string[]; // Backstage permissions
  iat: number;           // Issued at
  exp: number;           // Expiration (1 hour)
}

// API Key Structure (hashed in database)
interface APIKey {
  key_id: string;
  tenant_id: string;
  key_hash: string;      // bcrypt hash of API key
  scopes: string[];      // e.g., ['read:catalog', 'write:webhooks']
  expires_at: Date;
  last_used_at: Date;
}

3.3 Entity Naming Conventions

Cross-Tenant Uniqueness:

# Standard entity ref format
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  # Client-prefixed names prevent collisions
  name: acme-corp-payment-service
  namespace: acme-corp        # Tenant slug as namespace
  annotations:
    backstage.io/techdocs-ref: dir:.
    backstage.io/source-location: url:https://github.com/acme-corp/payment-service
    terraform.io/workspace-id: ws-abc123def456
    terraform.io/organization: acme-corp-prod
spec:
  type: service
  lifecycle: production
  owner: acme-corp-platform-team
  system: acme-corp-payments

Namespace Hierarchy:

{tenant-slug}                     # Root namespace (e.g., acme-corp)
├── {tenant-slug}-infrastructure  # Infrastructure components
├── {tenant-slug}-platform        # Platform services
└── {tenant-slug}-{business-unit} # Business unit namespaces
    ├── {bu}-development
    ├── {bu}-staging
    └── {bu}-production

4. State Sanitization Pipeline Architecture

4.1 Sanitization Workflow

4.2 Sensitive Data Detection Rules

Rule Categories:

Category	Detection Method	Example Patterns	Action
PII	Regex + ML	Email, SSN, phone numbers	Redact
Credentials	Keyword + entropy	API keys, passwords, tokens	Redact
Private Keys	Format detection	RSA/EC keys, certificates	Redact
Cloud Secrets	Provider-specific	GCP service account keys, AWS access keys	Redact
Database Credentials	Connection strings	`postgres://user:pass@host`	Redact
IP Addresses	Regex (private ranges)	`10.x.x.x`, `192.168.x.x` (configurable)	Redact or Allow

Detection Engine:

interface SanitizationRule {
  id: string;
  name: string;
  category: 'pii' | 'credential' | 'infrastructure';
  enabled: boolean;
  pattern: RegExp | string;
  action: 'redact' | 'hash' | 'allow';
  priority: number; // Higher priority rules run first
}

class StateSanitizer {
  private rules: SanitizationRule[];
  private allowlists: Map<string, Set<string>>; // Per-tenant allowlists

  constructor(tenantId: string) {
    this.rules = this.loadRules(tenantId);
    this.allowlists = this.loadAllowlists(tenantId);
  }

  async sanitize(state: TerraformState): Promise<SanitizedState> {
    const violations: Violation[] = [];
    const redactedState = cloneDeep(state);

    // Traverse JSON recursively
    this.traverseAndSanitize(redactedState.resources, violations);

    // Log all redactions for audit
    await this.auditLog.record({
      tenantId: this.tenantId,
      stateId: state.id,
      violations: violations,
      timestamp: new Date()
    });

    return {
      sanitized: redactedState,
      violations: violations,
      safe: violations.length === 0
    };
  }

  private traverseAndSanitize(obj: any, violations: Violation[], path: string = ''): void {
    for (const [key, value] of Object.entries(obj)) {
      const currentPath = `${path}.${key}`;

      // Check if key itself is sensitive (e.g., "password", "api_key")
      if (this.isSensitiveKey(key)) {
        violations.push({ path: currentPath, rule: 'sensitive-key', value: '***' });
        obj[key] = '[REDACTED]';
        continue;
      }

      // Check if value matches sensitive patterns
      if (typeof value === 'string') {
        const match = this.matchesRule(value);
        if (match && !this.isAllowlisted(currentPath, value)) {
          violations.push({ path: currentPath, rule: match.id, value: '***' });
          obj[key] = `[REDACTED:${match.category}]`;
        }
      } else if (typeof value === 'object') {
        this.traverseAndSanitize(value, violations, currentPath);
      }
    }
  }
}

4.3 Configurable Sanitization Rules

Per-Tenant Configuration:

# Example: acme-corp sanitization config
tenant_id: acme-corp
sanitization:
  global_rules:
    - id: strip-passwords
      enabled: true
      pattern: "(password|passwd|pwd)\\s*[:=]\\s*['\"]?([^'\"\\s]+)"
      action: redact

    - id: strip-api-keys
      enabled: true
      pattern: "[a-zA-Z0-9]{32,}"  # High-entropy strings
      min_entropy: 4.5
      action: redact

  allowlists:
    # Allow specific IP ranges
    - pattern: "10\\.128\\..*"
      reason: "Internal VPC CIDR"

    # Allow specific service account emails
    - pattern: "terraform@acme-corp\\.iam\\.gserviceaccount\\.com"
      reason: "Public service account for Terraform Cloud"

  custom_rules:
    - id: redact-internal-domains
      pattern: ".*\\.internal\\.acme\\.com"
      action: redact
      reason: "Internal domain names are confidential"

4.4 Audit Logging

Audit Log Schema:

CREATE TABLE sanitization_audit_log (
  log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
  workspace_id VARCHAR(255) NOT NULL,
  state_version VARCHAR(100) NOT NULL,
  violations JSONB NOT NULL, -- Array of violation objects
  sanitized_at TIMESTAMP DEFAULT NOW(),

  -- For compliance reporting
  pii_count INTEGER DEFAULT 0,
  credential_count INTEGER DEFAULT 0,
  total_redactions INTEGER DEFAULT 0,

  INDEX idx_audit_tenant_time (tenant_id, sanitized_at),
  INDEX idx_audit_workspace (workspace_id)
);

-- Example query: Violations by tenant over last 30 days
SELECT
  tenant_id,
  COUNT(*) as total_sanitizations,
  SUM(pii_count) as total_pii_redactions,
  SUM(credential_count) as total_credential_redactions
FROM sanitization_audit_log
WHERE sanitized_at > NOW() - INTERVAL '30 days'
GROUP BY tenant_id;

5. Scalability Architecture

5.1 System Components

5.2 Horizontal Scaling Strategy

Auto-Scaling Policies:

Component	Metric	Scale Up Threshold	Scale Down Threshold	Min/Max Replicas
Backend API	CPU Utilization	> 70%	< 30%	3 / 50
Catalog Processor	Queue Depth	> 1000 messages	< 100 messages	5 / 100
Webhook Handler	Request Rate	> 500 req/sec	< 100 req/sec	2 / 20

Kubernetes HorizontalPodAutoscaler:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: catalog-processor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: catalog-processor
  minReplicas: 5
  maxReplicas: 100
  metrics:
    - type: External
      external:
        metric:
          name: pubsub.googleapis.com|subscription|num_undelivered_messages
          selector:
            matchLabels:
              resource.labels.subscription_id: terraform-state-updates
        target:
          type: AverageValue
          averageValue: "1000"

    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

5.3 Queue Architecture

Message Queue Design (Google Cloud Pub/Sub):

topics:
  - name: terraform-workspace-discovered
    description: "New workspace discovered via GitHub scan or TFC webhook"
    message_retention: 7 days
    subscriptions:
      - name: catalog-processor-subscription
        ack_deadline: 600s  # 10 minutes for state processing
        retry_policy:
          minimum_backoff: 10s
          maximum_backoff: 600s

  - name: terraform-state-updated
    description: "State version updated in Terraform Cloud"
    message_retention: 7 days
    subscriptions:
      - name: state-sync-subscription
        ack_deadline: 300s
        retry_policy:
          minimum_backoff: 5s
          maximum_backoff: 300s

  - name: sanitization-failed
    description: "Dead letter queue for sanitization failures"
    message_retention: 30 days
    subscriptions:
      - name: manual-review-subscription
        ack_deadline: 3600s  # 1 hour for manual review

Message Schema:

interface WorkspaceDiscoveredMessage {
  tenant_id: string;
  workspace_id: string;
  workspace_name: string;
  organization: string;
  repository_url: string;
  discovered_at: string;
  discovery_method: 'github-scan' | 'terraform-webhook' | 'manual';
}

interface StateUpdatedMessage {
  tenant_id: string;
  workspace_id: string;
  state_version: string;
  run_id: string;
  updated_at: string;
  priority: 'high' | 'normal' | 'low';
}

5.4 Database Optimization

Partitioning Strategy:

-- Partition catalog_entities by tenant_id (hash partitioning)
CREATE TABLE catalog_entities (
  entity_id UUID DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  entity_ref VARCHAR(500) NOT NULL,
  metadata JSONB NOT NULL,
  created_at TIMESTAMP DEFAULT NOW(),
  ...
) PARTITION BY HASH (tenant_id);

-- Create 10 partitions (adjust based on tenant count)
CREATE TABLE catalog_entities_p0 PARTITION OF catalog_entities
  FOR VALUES WITH (MODULUS 10, REMAINDER 0);
CREATE TABLE catalog_entities_p1 PARTITION OF catalog_entities
  FOR VALUES WITH (MODULUS 10, REMAINDER 1);
-- ... create p2 through p9

-- Index on each partition
CREATE INDEX idx_entities_p0_ref ON catalog_entities_p0(entity_ref);
CREATE INDEX idx_entities_p1_ref ON catalog_entities_p1(entity_ref);
-- ... create indexes on p2 through p9

Caching Strategy (Redis):

// Cache layers
const CACHE_TTL = {
  CATALOG_ENTITY: 300,      // 5 minutes
  WORKSPACE_METADATA: 600,  // 10 minutes
  TENANT_CONFIG: 3600,      // 1 hour
  STATE_CHECKSUM: 86400,    // 24 hours (for change detection)
};

// Cache key patterns
const cacheKeys = {
  entity: (tenantId: string, entityRef: string) =>
    `entity:${tenantId}:${entityRef}`,

  workspaceList: (tenantId: string) =>
    `workspaces:${tenantId}:list`,

  tenantConfig: (tenantId: string) =>
    `tenant:${tenantId}:config`,
};

// Cache-aside pattern
async function getCatalogEntity(tenantId: string, entityRef: string): Promise<Entity> {
  const cacheKey = cacheKeys.entity(tenantId, entityRef);

  // Try cache first
  const cached = await redis.get(cacheKey);
  if (cached) {
    return JSON.parse(cached);
  }

  // Cache miss: fetch from database
  const entity = await db.query(
    'SELECT * FROM catalog_entities WHERE tenant_id = $1 AND entity_ref = $2',
    [tenantId, entityRef]
  );

  // Store in cache
  await redis.setex(cacheKey, CACHE_TTL.CATALOG_ENTITY, JSON.stringify(entity));

  return entity;
}

5.5 Performance Projections

Capacity Planning (100 Clients):

Resource	Per Client	100 Clients	Notes
Workspaces	200	20,000	Average across clients
Entities	1,000	100,000	Components, systems, APIs
State Syncs/Day	500	50,000	~0.6 syncs/second average
API Requests/Day	10,000	1,000,000	~12 req/second average
Database Size	50 MB	5 GB	JSONB compressed
Queue Messages/Day	1,000	100,000	Burst capacity: 1000 msg/sec

Cost Estimates (GCP us-central1):

Service	Specification	Monthly Cost
GKE Cluster	10x n2-standard-8 nodes	$2,500
Cloud SQL (PostgreSQL)	db-custom-8-32GB (HA)	$800
Cloud Pub/Sub	100M messages/month	$400
Cloud Storage	100 GB (state backups)	$2
Secret Manager	1000 secrets x 10k accesses	$30
Load Balancer	10 TB ingress/egress	$300
Total		~$4,000/month

Cost per Client: ~$40/month (at 100 clients)

6. Dynamic Discovery & Onboarding

6.1 GitHub Repository Scanning

Discovery Flow:

Implementation:

// GitHub scanner service
class GitHubRepoScanner {
  private octokit: Octokit;
  private tenantId: string;

  async scanOrganization(orgName: string): Promise<DiscoveryResult[]> {
    const repos = await this.octokit.paginate(
      'GET /orgs/{org}/repos',
      { org: orgName, per_page: 100 }
    );

    const results: DiscoveryResult[] = [];

    for (const repo of repos) {
      try {
        // Fetch catalog-info.yaml from default branch
        const catalogInfo = await this.fetchCatalogInfo(repo);

        if (catalogInfo) {
          const workspaceId = this.extractWorkspaceId(catalogInfo);

          results.push({
            repository: repo.full_name,
            workspace_id: workspaceId,
            entity_ref: catalogInfo.metadata.name,
            discovered_at: new Date()
          });

          // Enqueue for processing
          await this.enqueueDiscovery(catalogInfo, workspaceId);
        }
      } catch (error) {
        console.error(`Failed to scan ${repo.full_name}:`, error);
      }
    }

    return results;
  }

  private async fetchCatalogInfo(repo: Repository): Promise<CatalogInfo | null> {
    try {
      const response = await this.octokit.rest.repos.getContent({
        owner: repo.owner.login,
        repo: repo.name,
        path: 'catalog-info.yaml',
        ref: repo.default_branch
      });

      if ('content' in response.data) {
        const content = Buffer.from(response.data.content, 'base64').toString();
        return yaml.parse(content);
      }
    } catch (error) {
      if (error.status === 404) {
        return null; // No catalog-info.yaml
      }
      throw error;
    }
  }

  private extractWorkspaceId(catalogInfo: CatalogInfo): string | null {
    return catalogInfo.metadata.annotations?.['terraform.io/workspace-id'];
  }
}

6.2 Terraform Cloud Workspace Enumeration

Bulk Workspace Discovery:

// Terraform Cloud workspace scanner
class TerraformCloudScanner {
  private client: TerraformCloudClient;
  private tenantId: string;

  async enumerateWorkspaces(orgName: string): Promise<Workspace[]> {
    const workspaces: Workspace[] = [];
    let page = 1;
    let hasMore = true;

    while (hasMore) {
      const response = await this.client.get(
        `/organizations/${orgName}/workspaces`,
        {
          params: {
            'page[size]': 100,
            'page[number]': page
          }
        }
      );

      workspaces.push(...response.data);

      hasMore = response.meta.pagination.next_page !== null;
      page++;

      // Rate limit protection
      await this.rateLimiter.waitForCapacity(this.tenantId);
    }

    return workspaces;
  }

  async linkToRepository(workspace: Workspace): Promise<string | null> {
    // Extract repository from workspace VCS settings
    if (workspace.attributes['vcs-repo']) {
      const vcsRepo = workspace.attributes['vcs-repo'];
      return `https://github.com/${vcsRepo.identifier}`;
    }
    return null;
  }
}

6.3 Automated Onboarding Workflow

New Business Unit Setup (< 5 minutes):

Repository Creation (Manual): DevOps team creates bu-{name}-infra repo in GitHub
catalog-info.yaml Template: CI/CD adds template during repo initialization
Terraform Cloud Workspace: Created automatically via TFC API
First Scan (Automated): Scheduled scanner picks up new repo within 5 minutes
State Sync (Automated): Catalog processor fetches state and creates entity
Backstage Visibility: Entity appears in catalog within 30 seconds

Onboarding Template (catalog-info.yaml):

apiVersion: backstage.io/v1alpha1
kind: System
metadata:
  name: {{ tenant_slug }}-{{ business_unit }}-infrastructure
  namespace: {{ tenant_slug }}
  description: Infrastructure for {{ business_unit }} business unit
  annotations:
    # Automatically linked by discovery
    terraform.io/workspace-id: "{{ workspace_id }}"
    terraform.io/organization: "{{ tenant_slug }}-{{ environment }}"
    github.com/project-slug: "{{ org }}/{{ repo }}"
    backstage.io/techdocs-ref: dir:.
spec:
  owner: {{ tenant_slug }}-{{ business_unit }}-team
  domain: infrastructure
  type: infrastructure

7. Plugin Architecture Components

7.1 Backstage Plugin Structure

backstage-plugin-terraform-cloud/
├── backend/                          # Backend plugin
│   ├── src/
│   │   ├── service/
│   │   │   ├── router.ts            # API routes
│   │   │   ├── TerraformCloudClient.ts
│   │   │   ├── StateSanitizer.ts
│   │   │   └── CatalogSync.ts
│   │   ├── processors/
│   │   │   └── TerraformCloudEntityProcessor.ts
│   │   ├── providers/
│   │   │   └── TerraformCloudEntityProvider.ts
│   │   └── plugin.ts
│   └── package.json
│
├── frontend/                         # Frontend plugin
│   ├── src/
│   │   ├── components/
│   │   │   ├── TerraformWorkspaceCard/
│   │   │   ├── StateResourcesTable/
│   │   │   └── WorkspaceRunsTimeline/
│   │   ├── routes.ts
│   │   └── plugin.ts
│   └── package.json
│
├── common/                           # Shared types
│   ├── src/
│   │   ├── types.ts
│   │   └── api.ts
│   └── package.json
│
└── docs/
    ├── setup.md
    └── configuration.md

7.2 Backend Plugin (Catalog Processor)

Core Responsibilities:

Fetch Terraform Cloud state
Sanitize sensitive data
Transform state to Backstage entities
Handle webhook events

Implementation:

// backend/src/processors/TerraformCloudEntityProcessor.ts
import { CatalogProcessor, CatalogProcessorEmit } from '@backstage/plugin-catalog-node';
import { Entity } from '@backstage/catalog-model';
import { TerraformCloudClient } from '../service/TerraformCloudClient';
import { StateSanitizer } from '../service/StateSanitizer';

export class TerraformCloudEntityProcessor implements CatalogProcessor {
  private client: TerraformCloudClient;
  private sanitizer: StateSanitizer;

  constructor(
    client: TerraformCloudClient,
    sanitizer: StateSanitizer
  ) {
    this.client = client;
    this.sanitizer = sanitizer;
  }

  getProcessorName(): string {
    return 'TerraformCloudEntityProcessor';
  }

  async postProcessEntity(
    entity: Entity,
    _location: any,
    emit: CatalogProcessorEmit
  ): Promise<Entity> {
    // Check if entity has Terraform Cloud annotations
    const workspaceId = entity.metadata.annotations?.['terraform.io/workspace-id'];

    if (!workspaceId) {
      return entity; // Not a Terraform-managed entity
    }

    try {
      // Fetch latest state from Terraform Cloud
      const state = await this.client.fetchWorkspaceState(workspaceId);

      // Sanitize state
      const sanitized = await this.sanitizer.sanitize(state);

      // Extract resources and create child entities
      for (const resource of sanitized.resources) {
        const childEntity = this.createResourceEntity(entity, resource);
        emit({ type: 'entity', entity: childEntity });
      }

      // Enrich parent entity with state metadata
      entity.metadata.annotations = {
        ...entity.metadata.annotations,
        'terraform.io/state-version': state.version,
        'terraform.io/last-updated': state.updated_at,
        'terraform.io/resource-count': sanitized.resources.length.toString()
      };

      return entity;
    } catch (error) {
      console.error(`Failed to process Terraform entity ${entity.metadata.name}:`, error);
      return entity; // Return original entity on error
    }
  }

  private createResourceEntity(parent: Entity, resource: any): Entity {
    return {
      apiVersion: 'backstage.io/v1alpha1',
      kind: 'Resource',
      metadata: {
        name: `${parent.metadata.name}-${resource.type}-${resource.name}`,
        namespace: parent.metadata.namespace,
        annotations: {
          'terraform.io/resource-address': resource.address,
          'terraform.io/resource-type': resource.type,
          'backstage.io/managed-by-location': `terraform:${parent.metadata.annotations?.['terraform.io/workspace-id']}`
        }
      },
      spec: {
        type: resource.type,
        owner: parent.spec?.owner || 'unknown',
        dependsOn: [`component:${parent.metadata.name}`],
        ...resource.values
      }
    };
  }
}

7.3 Entity Provider (Real-Time Sync)

Webhook-Driven Updates:

// backend/src/providers/TerraformCloudEntityProvider.ts
import { EntityProvider, EntityProviderConnection } from '@backstage/plugin-catalog-node';
import { TerraformCloudClient } from '../service/TerraformCloudClient';

export class TerraformCloudEntityProvider implements EntityProvider {
  private connection?: EntityProviderConnection;
  private client: TerraformCloudClient;

  constructor(client: TerraformCloudClient) {
    this.client = client;
  }

  getProviderName(): string {
    return 'TerraformCloudEntityProvider';
  }

  async connect(connection: EntityProviderConnection): Promise<void> {
    this.connection = connection;

    // Start periodic full sync (every 1 hour as backup)
    setInterval(() => this.fullSync(), 3600000);

    // Initial sync on startup
    await this.fullSync();
  }

  async fullSync(): Promise<void> {
    if (!this.connection) return;

    console.log('Starting full Terraform Cloud sync...');

    // Fetch all entities from database (already processed)
    const entities = await this.fetchAllEntities();

    // Apply entities to catalog
    await this.connection.applyMutation({
      type: 'full',
      entities: entities.map(entity => ({
        entity,
        locationKey: `terraform:${entity.metadata.annotations?.['terraform.io/workspace-id']}`
      }))
    });

    console.log(`Full sync completed: ${entities.length} entities`);
  }

  async handleWebhookEvent(event: TerraformWebhookEvent): Promise<void> {
    if (!this.connection) return;

    // Fetch updated entity
    const entity = await this.client.fetchEntityForWorkspace(event.workspace_id);

    // Apply delta update to catalog
    await this.connection.applyMutation({
      type: 'delta',
      added: [entity],
      removed: []
    });

    console.log(`Webhook sync completed for workspace ${event.workspace_id}`);
  }
}

7.4 Frontend Plugin (UI Components)

Workspace Details Card:

// frontend/src/components/TerraformWorkspaceCard/TerraformWorkspaceCard.tsx
import React from 'react';
import { Entity } from '@backstage/catalog-model';
import { InfoCard, Link } from '@backstage/core-components';
import { useApi } from '@backstage/core-plugin-api';
import { terraformCloudApiRef } from '../../api';

export const TerraformWorkspaceCard = ({ entity }: { entity: Entity }) => {
  const api = useApi(terraformCloudApiRef);
  const workspaceId = entity.metadata.annotations?.['terraform.io/workspace-id'];

  const [workspace, setWorkspace] = React.useState<any>(null);
  const [loading, setLoading] = React.useState(true);

  React.useEffect(() => {
    if (workspaceId) {
      api.getWorkspace(workspaceId).then(setWorkspace).finally(() => setLoading(false));
    }
  }, [workspaceId, api]);

  if (loading) return <InfoCard title="Terraform Workspace">Loading...</InfoCard>;
  if (!workspace) return null;

  return (
    <InfoCard title="Terraform Workspace">
      <div>
        <strong>Organization:</strong> {workspace.organization} <br />
        <strong>Workspace:</strong> <Link to={workspace.url}>{workspace.name}</Link> <br />
        <strong>State Version:</strong> {workspace.currentStateVersion} <br />
        <strong>Last Updated:</strong> {new Date(workspace.updatedAt).toLocaleString()} <br />
        <strong>Resource Count:</strong> {workspace.resourceCount} <br />
      </div>
    </InfoCard>
  );
};

7.5 Configuration Schema

app-config.yaml Structure:

# Multi-tenant Terraform Cloud plugin configuration
terraformCloud:
  # Enable/disable plugin
  enabled: true

  # Tenant-specific configurations
  tenants:
    - id: acme-corp
      slug: acme-corp
      organization: acme-corp-prod
      auth:
        tokenSecretRef: projects/my-project/secrets/tfc-acme-corp-token

      sanitization:
        rulesConfigPath: gs://backstage-config/acme-corp/sanitization-rules.yaml
        allowlistPath: gs://backstage-config/acme-corp/allowlist.yaml

      discovery:
        github:
          enabled: true
          organizations:
            - acme-corp
          scanIntervalMinutes: 5

        terraformCloud:
          enabled: true
          webhooksEnabled: true

    - id: globex-inc
      slug: globex-inc
      organization: globex-production
      auth:
        tokenSecretRef: projects/my-project/secrets/tfc-globex-token

      sanitization:
        rulesConfigPath: gs://backstage-config/globex-inc/sanitization-rules.yaml

      discovery:
        github:
          enabled: true
          organizations:
            - globex-inc
          scanIntervalMinutes: 10

  # Global rate limiting
  rateLimiting:
    enabled: true
    requestsPerSecond: 30
    burstSize: 50

  # Webhook server configuration
  webhooks:
    enabled: true
    port: 8080
    path: /api/webhooks/terraform-cloud
    signatureHeader: X-TFE-Notification-Signature

8. Security & Compliance Architecture

8.1 Security Boundaries

8.2 Zero Trust Architecture

Principles:

Never Trust, Always Verify: Every request authenticated and authorized
Least Privilege Access: Minimal permissions per component
Assume Breach: Defense in depth with multiple layers
Encrypt Everything: Data at rest, in transit, and in use

Implementation:

// Zero Trust middleware stack
app.use([
  // 1. Validate request integrity
  validateRequestSignature,

  // 2. Authenticate request (JWT, API key, or mTLS)
  authenticate,

  // 3. Extract tenant context
  extractTenantContext,

  // 4. Authorize action against RBAC policies
  authorize,

  // 5. Inject tenant context into database session
  injectTenantContext,

  // 6. Audit log all requests
  auditLog,

  // 7. Rate limit per tenant
  rateLimitByTenant
]);

// Example: RBAC policy
interface RBACPolicy {
  tenant_id: string;
  user_id: string;
  permissions: {
    action: 'read' | 'write' | 'admin';
    resource: 'catalog' | 'config' | 'logs';
    scope: string; // e.g., "namespace:acme-corp-dev"
  }[];
}

async function authorize(req: Request, res: Response, next: NextFunction) {
  const { tenantId, userId, action, resource } = req.context;

  const hasPermission = await rbac.check({
    tenant_id: tenantId,
    user_id: userId,
    action: action,
    resource: resource
  });

  if (!hasPermission) {
    return res.status(403).json({
      error: 'Insufficient permissions',
      required: `${action}:${resource}`
    });
  }

  next();
}

8.3 Encryption Strategy

Data at Rest:

Database: PostgreSQL native encryption with customer-managed keys (CMEK)
Backups: AES-256 encrypted via Cloud Storage CMEK
Secrets: Google Secret Manager with automatic rotation

Data in Transit:

External: TLS 1.3 with perfect forward secrecy
Internal: mTLS between all services (service mesh)
Terraform Cloud API: TLS 1.3 + API token in headers

Data in Use:

Memory Encryption: Confidential VMs (AMD SEV-SNP) for sensitive workloads
Sanitization: In-memory processing, no disk writes for unredacted state

8.4 Compliance & Audit

SOC 2 Type II Controls:

Control	Implementation	Evidence
CC6.1 - Logical Access	JWT authentication + RBAC	Access logs, auth events
CC6.2 - Transmission Integrity	TLS 1.3 + certificate pinning	TLS handshake logs
CC6.3 - Transmission Confidentiality	End-to-end encryption	Encryption audit logs
CC7.2 - System Monitoring	Prometheus + Grafana + alerts	Metrics dashboards
CC7.4 - Change Management	GitOps + peer review	Git commit history

Audit Logging:

-- Comprehensive audit table
CREATE TABLE security_audit_log (
  log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  timestamp TIMESTAMP DEFAULT NOW(),

  -- Request context
  tenant_id UUID NOT NULL,
  user_id VARCHAR(255),
  ip_address INET,
  user_agent TEXT,

  -- Action details
  action VARCHAR(100) NOT NULL, -- e.g., 'catalog:read', 'config:write'
  resource VARCHAR(500) NOT NULL,
  resource_id VARCHAR(255),

  -- Outcome
  status INTEGER NOT NULL, -- HTTP status code
  error_message TEXT,

  -- For forensics
  request_id UUID NOT NULL,
  session_id UUID,

  INDEX idx_audit_tenant_time (tenant_id, timestamp DESC),
  INDEX idx_audit_action (action, timestamp DESC),
  INDEX idx_audit_user (user_id, timestamp DESC)
);

-- Example query: All failed authorization attempts in last 24h
SELECT
  timestamp,
  tenant_id,
  user_id,
  action,
  resource
FROM security_audit_log
WHERE status = 403
  AND timestamp > NOW() - INTERVAL '24 hours'
ORDER BY timestamp DESC;

9. Technology Stack Recommendations

9.1 Core Technologies

Layer	Technology	Justification	Alternatives
Backend Runtime	Node.js 20 LTS	Backstage native, async I/O	Deno (future)
Backend Framework	Express.js	Backstage plugin API compatibility	Fastify (performance)
Frontend	React 18	Backstage UI framework	N/A
Database	PostgreSQL 15	Backstage catalog backend, RLS support	N/A
Cache	Redis 7	Low-latency caching, pub/sub	Memcached
Message Queue	Google Cloud Pub/Sub	Managed, scalable, at-least-once delivery	RabbitMQ, Kafka
Container Orchestration	GKE (Kubernetes 1.28+)	Managed, auto-scaling, GCP integration	EKS, AKS
Service Mesh	Istio 1.20	mTLS, traffic management, observability	Linkerd
Secret Management	Google Secret Manager	Managed, automatic rotation, IAM integration	HashiCorp Vault
Observability	Prometheus + Grafana	Industry standard, rich ecosystem	Datadog, New Relic
CI/CD	GitHub Actions	Native GitHub integration	GitLab CI, Cloud Build

9.2 Terraform Cloud SDK

Official SDK:

// Use Terraform Cloud API via @hashicorp/terraform-cloud SDK
import { TerraformCloud } from '@hashicorp/terraform-cloud';

const tfc = new TerraformCloud({
  token: await secretManager.getSecret('tfc-token'),
  organization: 'acme-corp-prod'
});

// Fetch workspace
const workspace = await tfc.workspaces.show('my-workspace');

// Fetch latest state
const stateVersion = await tfc.stateVersions.current(workspace.id);
const stateDownloadUrl = stateVersion.data.attributes['hosted-state-download-url'];

// Download state JSON
const state = await fetch(stateDownloadUrl).then(r => r.json());

9.3 Backstage Catalog Extensions

Custom Entity Kinds:

# Define Terraform-specific entity kinds
apiVersion: backstage.io/v1alpha1
kind: TerraformWorkspace
metadata:
  name: acme-corp-prod-vpc
  namespace: acme-corp
  annotations:
    terraform.io/workspace-id: ws-abc123
    terraform.io/organization: acme-corp-prod
spec:
  type: infrastructure
  lifecycle: production
  owner: platform-team

  # Terraform-specific fields
  terraform:
    version: "1.6.0"
    provider_versions:
      google: "5.0.0"
      kubernetes: "2.23.0"

    locked: false
    auto_apply: false

    vcs:
      repository: "acme-corp/infrastructure-vpc"
      branch: "main"

    variables:
      - key: project_id
        sensitive: false
      - key: api_key
        sensitive: true # Indicates sanitized

9.4 Monitoring & Alerting Stack

Metrics Collection:

# Prometheus ServiceMonitor for plugin
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: backstage-terraform-plugin
spec:
  selector:
    matchLabels:
      app: backstage
      component: terraform-plugin
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Key Metrics:

// Custom Prometheus metrics
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const metrics = {
  // Counter: Total API requests
  apiRequests: new Counter({
    name: 'terraform_plugin_api_requests_total',
    help: 'Total API requests',
    labelNames: ['tenant_id', 'method', 'endpoint', 'status']
  }),

  // Histogram: API latency
  apiLatency: new Histogram({
    name: 'terraform_plugin_api_latency_seconds',
    help: 'API request latency',
    labelNames: ['tenant_id', 'endpoint'],
    buckets: [0.1, 0.5, 1, 2, 5, 10]
  }),

  // Gauge: Active workspaces per tenant
  activeWorkspaces: new Gauge({
    name: 'terraform_plugin_active_workspaces',
    help: 'Number of active workspaces',
    labelNames: ['tenant_id']
  }),

  // Counter: Sanitization violations
  sanitizationViolations: new Counter({
    name: 'terraform_plugin_sanitization_violations_total',
    help: 'Total sanitization violations detected',
    labelNames: ['tenant_id', 'rule_id', 'severity']
  })
};

Grafana Dashboards:

Tenant Overview: Workspaces, entities, API usage per tenant
Performance: Latency percentiles, error rates, queue depth
Security: Sanitization violations, auth failures, rate limit hits
Cost: API costs, database size, compute usage per tenant

10. Deployment Architecture

10.1 Kubernetes Deployment Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: backstage-terraform-plugin
  namespace: backstage
spec:
  replicas: 3
  selector:
    matchLabels:
      app: backstage
      component: terraform-plugin
  template:
    metadata:
      labels:
        app: backstage
        component: terraform-plugin
    spec:
      serviceAccountName: backstage-terraform-sa

      # Multi-container pod
      containers:
        # Main application
        - name: plugin
          image: gcr.io/my-project/backstage-terraform-plugin:v1.0.0
          ports:
            - containerPort: 8080
              name: http
            - containerPort: 9090
              name: metrics

          env:
            - name: NODE_ENV
              value: production
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: connection-string
            - name: REDIS_URL
              value: redis://redis-service:6379

          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 2000m
              memory: 4Gi

          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10

          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5

        # Cloud SQL Proxy sidecar
        - name: cloud-sql-proxy
          image: gcr.io/cloudsql-docker/gce-proxy:latest
          command:
            - "/cloud_sql_proxy"
            - "-instances=my-project:us-central1:backstage-db=tcp:5432"

          securityContext:
            runAsNonRoot: true

      # Security context
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
        runAsGroup: 1000

10.2 Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: backstage-terraform-plugin-hpa
  namespace: backstage
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backstage-terraform-plugin
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

    # Custom metric: Queue depth
    - type: External
      external:
        metric:
          name: pubsub.googleapis.com|subscription|num_undelivered_messages
          selector:
            matchLabels:
              resource.labels.subscription_id: terraform-state-updates
        target:
          type: AverageValue
          averageValue: "500"

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
      policies:
        - type: Percent
          value: 50  # Scale down by 50% at most
          periodSeconds: 60

    scaleUp:
      stabilizationWindowSeconds: 0 # Scale up immediately
      policies:
        - type: Percent
          value: 100  # Double capacity at most
          periodSeconds: 15
        - type: Pods
          value: 4   # Add 4 pods at most
          periodSeconds: 15
      selectPolicy: Max # Take the policy that scales fastest

11. Architecture Decision Records (ADRs)

ADR-001: Row-Level Security for Tenant Isolation

Status: Accepted Date: 2024-11-13

Context: Need to isolate tenant data in a cost-effective, scalable way while maintaining strong security boundaries.

Decision: Implement PostgreSQL Row-Level Security (RLS) with tenant discriminator column.

Rationale:

Security: Database-enforced isolation (not application-layer)
Cost: Single database instance for all tenants (vs. separate DBs)
Performance: Indexed tenant_id column, partitioning by tenant
Compliance: Meets SOC 2 data isolation requirements

Consequences:

Positive: Lower operational overhead, simpler backups
Negative: RLS policy complexity, PostgreSQL version dependency (9.5+)
Mitigation: Extensive testing of RLS policies, monitoring for leaks

ADR-002: Pub/Sub for Asynchronous Processing

Status: Accepted Date: 2024-11-13

Context: Need to handle burst traffic (1000+ workspace updates/minute) without blocking API requests.

Decision: Use Google Cloud Pub/Sub for message queuing between API and workers.

Rationale:

Scalability: Managed service, auto-scaling to millions of messages
Reliability: At-least-once delivery, dead letter queues
Integration: Native GCP integration, IAM-based auth

Consequences:

Positive: Decoupled architecture, elastic capacity
Negative: Potential message duplication (at-least-once), GCP vendor lock-in
Mitigation: Idempotent message handlers, multi-cloud strategy (future)

ADR-003: Real-Time Sync via Terraform Cloud Webhooks

Status: Accepted Date: 2024-11-13

Context: Users expect catalog updates within 5 minutes of Terraform apply.

Decision: Subscribe to Terraform Cloud webhooks for run completion events.

Rationale:

Latency: Near-real-time updates (< 30 seconds) vs. polling (5-15 minutes)
Efficiency: Event-driven reduces unnecessary API calls (rate limits)
Cost: Lower Terraform Cloud API usage

Consequences:

Positive: Faster catalog updates, better user experience
Negative: Webhook reliability dependency, HMAC signature verification overhead
Mitigation: Fallback polling every 1 hour, webhook retry logic

ADR-004: In-Memory State Sanitization

Status: Accepted Date: 2024-11-13

Context: Terraform state contains sensitive data (credentials, IPs) that must not reach Backstage catalog.

Decision: Sanitize state in-memory before database persistence, never write unredacted state to disk.

Rationale:

Security: Reduces attack surface (no plaintext state on disk)
Compliance: GDPR/CCPA compliance (no persistent PII)
Performance: In-memory processing faster than disk I/O

Consequences:

Positive: Stronger security posture, faster processing
Negative: Higher memory requirements, complex sanitization logic
Mitigation: Worker memory limits (4GB), sanitization rule versioning

12. Risk Analysis & Mitigation

12.1 High-Severity Risks

Risk	Probability	Impact	Mitigation
Cross-Tenant Data Leak	Low	Critical	RLS policies, extensive testing, audit logging, automated leak detection
Terraform Cloud Rate Limit	Medium	High	Per-tenant quotas, token bucket algorithm, caching, batch processing
Webhook Replay Attack	Low	Medium	HMAC signature verification, timestamp validation, nonce tracking
Database Outage	Low	High	High availability (Cloud SQL HA), automatic failover, connection pooling
PII Exposure in Catalog	Low	Critical	Multi-layer sanitization, regex + ML detection, audit trail, manual review queue

12.2 Mitigation Strategies

Cross-Tenant Data Leak Prevention:

-- Automated leak detection query (run hourly)
SELECT
  entity_ref,
  tenant_id,
  COUNT(*) OVER (PARTITION BY entity_ref) as duplicate_count
FROM catalog_entities
WHERE entity_ref IN (
  SELECT entity_ref
  FROM catalog_entities
  GROUP BY entity_ref
  HAVING COUNT(DISTINCT tenant_id) > 1
);
-- Alert if any results (entity visible to multiple tenants)

Rate Limit Handling:

// Exponential backoff with jitter
async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries: number = 5
): Promise<T> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status === 429 && attempt < maxRetries - 1) {
        const backoff = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s, 8s, 16s
        const jitter = Math.random() * 1000;
        await sleep(backoff + jitter);
      } else {
        throw error;
      }
    }
  }
  throw new Error('Max retries exceeded');
}

13. Scalability Projections

13.1 Growth Model (10 Clients → 100 Clients)

Metric	10 Clients	50 Clients	100 Clients	Notes
Workspaces	2,000	10,000	20,000	Linear growth
Entities	10,000	50,000	100,000	5 entities/workspace
Daily State Syncs	5,000	25,000	50,000	~0.5 syncs/workspace/day
Database Size	500 MB	2.5 GB	5 GB	Compressed JSONB
API Requests/Day	100K	500K	1M	10K req/client/day
Queue Messages/Day	10K	50K	100K	1K msg/client/day
GKE Nodes	3	6	10	n2-standard-8
Monthly Cost	$1,200	$2,500	$4,000	$40/client at scale

13.2 Breaking Points & Solutions

Database Query Performance (100K+ entities):

Problem: Full table scans slow down at 100K+ rows
Solution: Partitioning by tenant_id (10 partitions), covering indexes
Target: < 100ms p95 query latency

Terraform Cloud Rate Limits (30 req/sec shared):

Problem: 100 clients competing for 30 req/sec quota
Solution: Per-tenant quotas (0.3 req/sec each), intelligent caching (1-hour TTL)
Target: < 5% rate limit rejections

Memory Pressure (sanitization workload):

Problem: In-memory sanitization requires 100-500MB per state
Solution: Worker pods with 4GB memory, queue-based distribution, streaming JSON parsing
Target: < 2GB memory per worker at 80% utilization

14. Implementation Roadmap

Phase 1: Foundation (Weeks 1-4)

Set up multi-tenant PostgreSQL database with RLS
Implement Terraform Cloud API client with rate limiting
Build basic state sanitization engine
Deploy backend plugin to GKE (single tenant PoC)

Phase 2: Multi-Tenant Core (Weeks 5-8)

Implement tenant context middleware
Build API key authentication system
Add per-tenant sanitization rules
Deploy Pub/Sub message queue
Implement catalog entity processor

Phase 3: Dynamic Discovery (Weeks 9-12)

Build GitHub repository scanner
Implement Terraform Cloud workspace enumeration
Add webhook event handling
Deploy automated onboarding workflow

Phase 4: Frontend & Polish (Weeks 13-16)

Build React UI components
Add Terraform workspace detail cards
Implement admin dashboard
Write end-to-end tests

Phase 5: Production Readiness (Weeks 17-20)

Load testing (10K concurrent users)
Security audit (SOC 2 prep)
Performance optimization
Documentation & runbooks

15. Conclusion

This architecture provides an enterprise-grade foundation for a multi-tenant Backstage plugin that integrates with Terraform Cloud at scale. Key design principles:

Security First: Zero trust, encryption everywhere, tenant isolation
Scalability: Horizontal scaling, queue-based architecture, database partitioning
Real-Time: Webhook-driven updates, sub-5-minute latency
Cost-Effective: Shared infrastructure, $40/client at 100 clients
Compliance: SOC 2 Type II ready, comprehensive audit logging

Success Metrics:

100+ enterprise clients supported
99.9% uptime SLA
< 5 minute catalog sync latency
< 200ms API response time (p95)
Zero cross-tenant data leaks

Next Steps:

Review with stakeholders (security, infrastructure, product)
Finalize technology stack and vendor selection
Begin Phase 1 implementation (foundation)
Set up continuous integration and deployment pipelines

Appendix A: Glossary

RLS: Row-Level Security (PostgreSQL feature)
CMEK: Customer-Managed Encryption Keys
HPA: Horizontal Pod Autoscaler (Kubernetes)
TFC: Terraform Cloud
RBAC: Role-Based Access Control
SOC 2: Service Organization Control 2 (compliance standard)

Appendix B: References

Document Control

Version	Date	Author	Changes
1.0	2024-11-13	System Architect Agent	Initial architecture design

Terraform Cloud Integration for SaaS Platform​

Executive Summary​

Key Metrics & Targets​

1. System Context & Requirements​

1.1 Business Context​

1.2 Critical Requirements​

Functional Requirements​

Non-Functional Requirements​

2. Terraform Cloud Integration Architecture​

2.1 API Authentication Strategy​

Authentication Hierarchy​

2.2 Workspace Discovery API​

2.3 Webhook Event Processing​

3. Multi-Tenant Data Architecture​

3.1 Tenant Isolation Strategy​

3.2 Tenant Identification Methods​

3.3 Entity Naming Conventions​

4. State Sanitization Pipeline Architecture​

4.1 Sanitization Workflow​

4.2 Sensitive Data Detection Rules​

4.3 Configurable Sanitization Rules​

4.4 Audit Logging​

5. Scalability Architecture​

5.1 System Components​

5.2 Horizontal Scaling Strategy​

5.3 Queue Architecture​

5.4 Database Optimization​

5.5 Performance Projections​

6. Dynamic Discovery & Onboarding​

6.1 GitHub Repository Scanning​

6.2 Terraform Cloud Workspace Enumeration​

6.3 Automated Onboarding Workflow​

7. Plugin Architecture Components​

7.1 Backstage Plugin Structure​

7.2 Backend Plugin (Catalog Processor)​

7.3 Entity Provider (Real-Time Sync)​

7.4 Frontend Plugin (UI Components)​

7.5 Configuration Schema​

8. Security & Compliance Architecture​

8.1 Security Boundaries​

8.2 Zero Trust Architecture​

8.3 Encryption Strategy​

8.4 Compliance & Audit​

9. Technology Stack Recommendations​

9.1 Core Technologies​

9.2 Terraform Cloud SDK​

9.3 Backstage Catalog Extensions​

9.4 Monitoring & Alerting Stack​

10. Deployment Architecture​

10.1 Kubernetes Deployment Manifest​

10.2 Horizontal Pod Autoscaler​

11. Architecture Decision Records (ADRs)​

ADR-001: Row-Level Security for Tenant Isolation​

ADR-002: Pub/Sub for Asynchronous Processing​

ADR-003: Real-Time Sync via Terraform Cloud Webhooks​

ADR-004: In-Memory State Sanitization​

12. Risk Analysis & Mitigation​

12.1 High-Severity Risks​

12.2 Mitigation Strategies​

13. Scalability Projections​

13.1 Growth Model (10 Clients → 100 Clients)​

13.2 Breaking Points & Solutions​

14. Implementation Roadmap​

Phase 1: Foundation (Weeks 1-4)​

Phase 2: Multi-Tenant Core (Weeks 5-8)​

Phase 3: Dynamic Discovery (Weeks 9-12)​

Phase 4: Frontend & Polish (Weeks 13-16)​

Phase 5: Production Readiness (Weeks 17-20)​

15. Conclusion​

Terraform Cloud Integration for SaaS Platform

Executive Summary

Key Metrics & Targets

1. System Context & Requirements

1.1 Business Context

1.2 Critical Requirements

Functional Requirements

Non-Functional Requirements

2. Terraform Cloud Integration Architecture

2.1 API Authentication Strategy

Authentication Hierarchy

2.2 Workspace Discovery API

2.3 Webhook Event Processing

3. Multi-Tenant Data Architecture

3.1 Tenant Isolation Strategy

3.2 Tenant Identification Methods

3.3 Entity Naming Conventions

4. State Sanitization Pipeline Architecture

4.1 Sanitization Workflow

4.2 Sensitive Data Detection Rules

4.3 Configurable Sanitization Rules

4.4 Audit Logging

5. Scalability Architecture

5.1 System Components

5.2 Horizontal Scaling Strategy

5.3 Queue Architecture

5.4 Database Optimization

5.5 Performance Projections

6. Dynamic Discovery & Onboarding

6.1 GitHub Repository Scanning

6.2 Terraform Cloud Workspace Enumeration

6.3 Automated Onboarding Workflow

7. Plugin Architecture Components

7.1 Backstage Plugin Structure

7.2 Backend Plugin (Catalog Processor)

7.3 Entity Provider (Real-Time Sync)

7.4 Frontend Plugin (UI Components)

7.5 Configuration Schema

8. Security & Compliance Architecture

8.1 Security Boundaries

8.2 Zero Trust Architecture

8.3 Encryption Strategy

8.4 Compliance & Audit

9. Technology Stack Recommendations

9.1 Core Technologies

9.2 Terraform Cloud SDK

9.3 Backstage Catalog Extensions

9.4 Monitoring & Alerting Stack

10. Deployment Architecture

10.1 Kubernetes Deployment Manifest

10.2 Horizontal Pod Autoscaler

11. Architecture Decision Records (ADRs)

ADR-001: Row-Level Security for Tenant Isolation

ADR-002: Pub/Sub for Asynchronous Processing

ADR-003: Real-Time Sync via Terraform Cloud Webhooks

ADR-004: In-Memory State Sanitization

12. Risk Analysis & Mitigation

12.1 High-Severity Risks

12.2 Mitigation Strategies

13. Scalability Projections

13.1 Growth Model (10 Clients → 100 Clients)

13.2 Breaking Points & Solutions

14. Implementation Roadmap

Phase 1: Foundation (Weeks 1-4)

Phase 2: Multi-Tenant Core (Weeks 5-8)

Phase 3: Dynamic Discovery (Weeks 9-12)

Phase 4: Frontend & Polish (Weeks 13-16)

Phase 5: Production Readiness (Weeks 17-20)

15. Conclusion