Skip to main content

Enterprise Multi-Tenant Backstage Plugin Architecture

Terraform Cloud Integration for SaaS Platform

Document Version: 1.0 Date: November 13, 2024 Status: Architecture Design - Enterprise Edition Classification: Internal Technical Design


Executive Summary

This document defines the enterprise-grade architecture for a multi-tenant Backstage plugin that integrates with Terraform Cloud to provide infrastructure catalog management across hundreds of client organizations. The design prioritizes security, scalability, tenant isolation, and real-time synchronization while maintaining a shared SaaS codebase.

Key Metrics & Targets

MetricTargetScale Factor
Clients Supported100+ enterprises10x current
Workspaces per Client50-500Variable
Total Entities100,000+10x current
API Rate Limit30 req/sec (Terraform Cloud)Shared resource
Sync Latency< 5 minutes (real-time)Event-driven
Data Isolation100% (zero cross-tenant leaks)Critical
Uptime SLA99.9%Enterprise-grade

1. System Context & Requirements

1.1 Business Context

Problem Statement:

  • Multiple enterprise clients need infrastructure visibility in Backstage
  • Each client has 10-100 business units with independent infrastructure repositories
  • Business units are dynamically created (M&A, reorganization, new initiatives)
  • Current manual catalog maintenance doesn't scale beyond 50 repositories
  • Security teams require sensitive data never appears in Backstage

Solution Vision: A multi-tenant SaaS Backstage plugin that:

  1. Automatically discovers infrastructure repositories across GitHub organizations
  2. Pulls Terraform state from Terraform Cloud workspaces
  3. Sanitizes sensitive data before catalog ingestion
  4. Maintains strict tenant isolation in shared database
  5. Updates catalog in near-real-time (< 5 minutes)
  6. Scales to 100+ clients with 1000+ repositories each

1.2 Critical Requirements

Functional Requirements

  1. Terraform Cloud Integration

    • Authenticate via organization tokens, team tokens, or user tokens
    • Discover all workspaces across multiple organizations
    • Download latest state versions via API
    • Subscribe to workspace run webhooks for real-time updates
    • Handle pagination (100+ workspace pages)
  2. Multi-Tenant Data Isolation

    • Client data NEVER visible to other clients
    • Separate authentication per client (API keys, JWT tokens)
    • Audit logging of all cross-tenant access attempts
    • Configurable tenant-specific sanitization rules
  3. State Sanitization

    • Detect and redact PII (emails, names, addresses)
    • Strip credentials (passwords, API keys, tokens, certificates)
    • Remove service account keys and private keys
    • Configurable allowlists per client
    • Audit trail of all redactions
  4. Dynamic Discovery

    • Auto-detect new business unit repositories (GitHub org scanning)
    • Extract metadata from catalog-info.yaml
    • Link repositories to Terraform Cloud workspaces
    • Detect deleted/archived repositories
  5. Scalability

    • Process 1000+ workspace updates concurrently
    • Queue-based architecture for burst handling
    • Database partitioning for 100+ clients
    • CDN caching for static catalog data

Non-Functional Requirements

  1. Security

    • SOC 2 Type II compliance
    • Encryption at rest (AES-256) and in transit (TLS 1.3)
    • Zero trust network architecture
    • Role-based access control (RBAC) per tenant
  2. Performance

    • < 200ms API response time (p95)
    • < 5 minute sync latency for state updates
    • Support 10,000 concurrent users
    • Database queries < 100ms (p95)
  3. Reliability

    • 99.9% uptime SLA
    • Automatic failover for database and queue
    • Graceful degradation under load
    • Circuit breakers for external API calls
  4. Observability

    • Structured logging with trace IDs
    • Metrics dashboards per tenant
    • Alerting for anomalies
    • Cost tracking per tenant

2. Terraform Cloud Integration Architecture

2.1 API Authentication Strategy

Authentication Hierarchy

Token TypeScopeUse CaseRotation
Organization TokenOrg-wide read/writeInitial setup, workspace creationManual (yearly)
Team TokenTeam-scoped readRead workspace states for specific teamsAutomatic (90 days)
User TokenUser-scoped readDevelopment/testing onlyAutomatic (30 days)

Implementation Details:

// Token hierarchy with automatic fallback
interface TerraformCloudAuth {
tenantId: string;
primaryToken: string; // Organization token
fallbackTokens: string[]; // Team tokens for redundancy
rotationPolicy: {
enabled: boolean;
intervalDays: number;
alertBeforeDays: number;
};
}

// Token storage in Google Secret Manager
const secretPath = `projects/${PROJECT_ID}/secrets/tfc-token-${tenantId}/versions/latest`;

2.2 Workspace Discovery API

Flow:

  1. Organization Listing: GET /api/v2/organizations
  2. Workspace Pagination: GET /api/v2/organizations/{org}/workspaces?page[size]=100&page[number]=1
  3. State Version Retrieval: GET /api/v2/workspaces/{workspace_id}/current-state-version
  4. State Download: GET {state_version.hosted_state_download_url}

Rate Limiting Strategy:

  • Terraform Cloud: 30 requests/second (shared across all users)
  • Plugin: Implements token bucket algorithm
  • Per-tenant quota: 5 requests/second
  • Burst allowance: 50 requests
// Rate limiter implementation
class TerraformCloudRateLimiter {
private buckets: Map<string, TokenBucket> = new Map();

async checkLimit(tenantId: string): Promise<boolean> {
const bucket = this.buckets.get(tenantId);
if (!bucket) {
this.buckets.set(tenantId, new TokenBucket(5, 5)); // 5 req/sec
}
return bucket.tryConsume(1);
}

async waitForCapacity(tenantId: string): Promise<void> {
while (!await this.checkLimit(tenantId)) {
await sleep(200); // Wait 200ms before retry
}
}
}

2.3 Webhook Event Processing

Real-Time Updates via Terraform Cloud Webhooks:

# Webhook Configuration
webhooks:
- name: "backstage-catalog-sync"
enabled: true
url: "https://plugin.backstage.example.com/api/webhooks/terraform-cloud"
token: "<webhook-secret-token>"
events:
- "run:completed"
- "run:errored"
- "workspace:created"
- "workspace:deleted"

Event Processing Flow:

Webhook Security:

  • HMAC SHA-256 signature verification
  • IP allowlist for Terraform Cloud endpoints
  • Replay attack prevention (timestamp + nonce)
  • Rate limiting per tenant (1000 events/hour)

3. Multi-Tenant Data Architecture

3.1 Tenant Isolation Strategy

Option Analysis:

StrategyProsConsRecommendation
Separate DatabasesMaximum isolation, simple RBACHigh cost, complex backups❌ Not scalable
Separate SchemasGood isolation, moderate costSchema migrations complex⚠️ Fallback option
Row-Level Security (RLS)Cost-effective, single DBComplex policies, performance overheadPrimary choice
Discriminator ColumnSimple to implementNo DB-level isolation, risky❌ Insufficient security

Selected Approach: PostgreSQL Row-Level Security (RLS) + Tenant Column

-- Tenant isolation table
CREATE TABLE tenants (
tenant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
org_slug VARCHAR(100) UNIQUE NOT NULL,
plan_tier VARCHAR(50) DEFAULT 'enterprise',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);

-- Catalog entities with tenant column
CREATE TABLE catalog_entities (
entity_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
entity_ref VARCHAR(500) NOT NULL,
entity_kind VARCHAR(100) NOT NULL,
entity_name VARCHAR(255) NOT NULL,
metadata JSONB NOT NULL,
spec JSONB NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),

CONSTRAINT unique_entity_per_tenant UNIQUE (tenant_id, entity_ref)
);

-- Enable Row-Level Security
ALTER TABLE catalog_entities ENABLE ROW LEVEL SECURITY;

-- Policy: Users can only see their tenant's data
CREATE POLICY tenant_isolation_policy ON catalog_entities
USING (tenant_id = current_setting('app.current_tenant')::UUID);

-- Index for performance
CREATE INDEX idx_entities_tenant_id ON catalog_entities(tenant_id);
CREATE INDEX idx_entities_kind_name ON catalog_entities(tenant_id, entity_kind, entity_name);

Tenant Context Injection:

// Middleware to set tenant context
app.use(async (req, res, next) => {
const tenantId = await extractTenantId(req); // From JWT, API key, or header

if (!tenantId) {
return res.status(401).json({ error: 'Missing tenant context' });
}

// Validate tenant exists and is active
const tenant = await db.query('SELECT * FROM tenants WHERE tenant_id = $1 AND status = $2',
[tenantId, 'active']);

if (!tenant) {
return res.status(403).json({ error: 'Invalid or inactive tenant' });
}

// Set PostgreSQL session variable for RLS
await db.query('SET LOCAL app.current_tenant = $1', [tenantId]);

req.tenantId = tenantId;
next();
});

3.2 Tenant Identification Methods

Priority Order:

  1. JWT Token (Production): Tenant ID embedded in signed JWT claims
  2. API Key (Service-to-Service): Hashed API key mapped to tenant
  3. Header Override (Development): X-Tenant-ID header (disabled in prod)
// JWT Token Structure
interface BackstageTenantJWT {
sub: string; // User ID
tenant_id: string; // Primary tenant identifier
tenant_slug: string; // Human-readable tenant name
permissions: string[]; // Backstage permissions
iat: number; // Issued at
exp: number; // Expiration (1 hour)
}

// API Key Structure (hashed in database)
interface APIKey {
key_id: string;
tenant_id: string;
key_hash: string; // bcrypt hash of API key
scopes: string[]; // e.g., ['read:catalog', 'write:webhooks']
expires_at: Date;
last_used_at: Date;
}

3.3 Entity Naming Conventions

Cross-Tenant Uniqueness:

# Standard entity ref format
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
# Client-prefixed names prevent collisions
name: acme-corp-payment-service
namespace: acme-corp # Tenant slug as namespace
annotations:
backstage.io/techdocs-ref: dir:.
backstage.io/source-location: url:https://github.com/acme-corp/payment-service
terraform.io/workspace-id: ws-abc123def456
terraform.io/organization: acme-corp-prod
spec:
type: service
lifecycle: production
owner: acme-corp-platform-team
system: acme-corp-payments

Namespace Hierarchy:

{tenant-slug}                     # Root namespace (e.g., acme-corp)
├── {tenant-slug}-infrastructure # Infrastructure components
├── {tenant-slug}-platform # Platform services
└── {tenant-slug}-{business-unit} # Business unit namespaces
├── {bu}-development
├── {bu}-staging
└── {bu}-production

4. State Sanitization Pipeline Architecture

4.1 Sanitization Workflow

4.2 Sensitive Data Detection Rules

Rule Categories:

CategoryDetection MethodExample PatternsAction
PIIRegex + MLEmail, SSN, phone numbersRedact
CredentialsKeyword + entropyAPI keys, passwords, tokensRedact
Private KeysFormat detectionRSA/EC keys, certificatesRedact
Cloud SecretsProvider-specificGCP service account keys, AWS access keysRedact
Database CredentialsConnection stringspostgres://user:pass@hostRedact
IP AddressesRegex (private ranges)10.x.x.x, 192.168.x.x (configurable)Redact or Allow

Detection Engine:

interface SanitizationRule {
id: string;
name: string;
category: 'pii' | 'credential' | 'infrastructure';
enabled: boolean;
pattern: RegExp | string;
action: 'redact' | 'hash' | 'allow';
priority: number; // Higher priority rules run first
}

class StateSanitizer {
private rules: SanitizationRule[];
private allowlists: Map<string, Set<string>>; // Per-tenant allowlists

constructor(tenantId: string) {
this.rules = this.loadRules(tenantId);
this.allowlists = this.loadAllowlists(tenantId);
}

async sanitize(state: TerraformState): Promise<SanitizedState> {
const violations: Violation[] = [];
const redactedState = cloneDeep(state);

// Traverse JSON recursively
this.traverseAndSanitize(redactedState.resources, violations);

// Log all redactions for audit
await this.auditLog.record({
tenantId: this.tenantId,
stateId: state.id,
violations: violations,
timestamp: new Date()
});

return {
sanitized: redactedState,
violations: violations,
safe: violations.length === 0
};
}

private traverseAndSanitize(obj: any, violations: Violation[], path: string = ''): void {
for (const [key, value] of Object.entries(obj)) {
const currentPath = `${path}.${key}`;

// Check if key itself is sensitive (e.g., "password", "api_key")
if (this.isSensitiveKey(key)) {
violations.push({ path: currentPath, rule: 'sensitive-key', value: '***' });
obj[key] = '[REDACTED]';
continue;
}

// Check if value matches sensitive patterns
if (typeof value === 'string') {
const match = this.matchesRule(value);
if (match && !this.isAllowlisted(currentPath, value)) {
violations.push({ path: currentPath, rule: match.id, value: '***' });
obj[key] = `[REDACTED:${match.category}]`;
}
} else if (typeof value === 'object') {
this.traverseAndSanitize(value, violations, currentPath);
}
}
}
}

4.3 Configurable Sanitization Rules

Per-Tenant Configuration:

# Example: acme-corp sanitization config
tenant_id: acme-corp
sanitization:
global_rules:
- id: strip-passwords
enabled: true
pattern: "(password|passwd|pwd)\\s*[:=]\\s*['\"]?([^'\"\\s]+)"
action: redact

- id: strip-api-keys
enabled: true
pattern: "[a-zA-Z0-9]{32,}" # High-entropy strings
min_entropy: 4.5
action: redact

allowlists:
# Allow specific IP ranges
- pattern: "10\\.128\\..*"
reason: "Internal VPC CIDR"

# Allow specific service account emails
- pattern: "terraform@acme-corp\\.iam\\.gserviceaccount\\.com"
reason: "Public service account for Terraform Cloud"

custom_rules:
- id: redact-internal-domains
pattern: ".*\\.internal\\.acme\\.com"
action: redact
reason: "Internal domain names are confidential"

4.4 Audit Logging

Audit Log Schema:

CREATE TABLE sanitization_audit_log (
log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
workspace_id VARCHAR(255) NOT NULL,
state_version VARCHAR(100) NOT NULL,
violations JSONB NOT NULL, -- Array of violation objects
sanitized_at TIMESTAMP DEFAULT NOW(),

-- For compliance reporting
pii_count INTEGER DEFAULT 0,
credential_count INTEGER DEFAULT 0,
total_redactions INTEGER DEFAULT 0,

INDEX idx_audit_tenant_time (tenant_id, sanitized_at),
INDEX idx_audit_workspace (workspace_id)
);

-- Example query: Violations by tenant over last 30 days
SELECT
tenant_id,
COUNT(*) as total_sanitizations,
SUM(pii_count) as total_pii_redactions,
SUM(credential_count) as total_credential_redactions
FROM sanitization_audit_log
WHERE sanitized_at > NOW() - INTERVAL '30 days'
GROUP BY tenant_id;

5. Scalability Architecture

5.1 System Components

5.2 Horizontal Scaling Strategy

Auto-Scaling Policies:

ComponentMetricScale Up ThresholdScale Down ThresholdMin/Max Replicas
Backend APICPU Utilization> 70%< 30%3 / 50
Catalog ProcessorQueue Depth> 1000 messages< 100 messages5 / 100
Webhook HandlerRequest Rate> 500 req/sec< 100 req/sec2 / 20

Kubernetes HorizontalPodAutoscaler:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: catalog-processor-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: catalog-processor
minReplicas: 5
maxReplicas: 100
metrics:
- type: External
external:
metric:
name: pubsub.googleapis.com|subscription|num_undelivered_messages
selector:
matchLabels:
resource.labels.subscription_id: terraform-state-updates
target:
type: AverageValue
averageValue: "1000"

- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

5.3 Queue Architecture

Message Queue Design (Google Cloud Pub/Sub):

topics:
- name: terraform-workspace-discovered
description: "New workspace discovered via GitHub scan or TFC webhook"
message_retention: 7 days
subscriptions:
- name: catalog-processor-subscription
ack_deadline: 600s # 10 minutes for state processing
retry_policy:
minimum_backoff: 10s
maximum_backoff: 600s

- name: terraform-state-updated
description: "State version updated in Terraform Cloud"
message_retention: 7 days
subscriptions:
- name: state-sync-subscription
ack_deadline: 300s
retry_policy:
minimum_backoff: 5s
maximum_backoff: 300s

- name: sanitization-failed
description: "Dead letter queue for sanitization failures"
message_retention: 30 days
subscriptions:
- name: manual-review-subscription
ack_deadline: 3600s # 1 hour for manual review

Message Schema:

interface WorkspaceDiscoveredMessage {
tenant_id: string;
workspace_id: string;
workspace_name: string;
organization: string;
repository_url: string;
discovered_at: string;
discovery_method: 'github-scan' | 'terraform-webhook' | 'manual';
}

interface StateUpdatedMessage {
tenant_id: string;
workspace_id: string;
state_version: string;
run_id: string;
updated_at: string;
priority: 'high' | 'normal' | 'low';
}

5.4 Database Optimization

Partitioning Strategy:

-- Partition catalog_entities by tenant_id (hash partitioning)
CREATE TABLE catalog_entities (
entity_id UUID DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
entity_ref VARCHAR(500) NOT NULL,
metadata JSONB NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
...
) PARTITION BY HASH (tenant_id);

-- Create 10 partitions (adjust based on tenant count)
CREATE TABLE catalog_entities_p0 PARTITION OF catalog_entities
FOR VALUES WITH (MODULUS 10, REMAINDER 0);
CREATE TABLE catalog_entities_p1 PARTITION OF catalog_entities
FOR VALUES WITH (MODULUS 10, REMAINDER 1);
-- ... create p2 through p9

-- Index on each partition
CREATE INDEX idx_entities_p0_ref ON catalog_entities_p0(entity_ref);
CREATE INDEX idx_entities_p1_ref ON catalog_entities_p1(entity_ref);
-- ... create indexes on p2 through p9

Caching Strategy (Redis):

// Cache layers
const CACHE_TTL = {
CATALOG_ENTITY: 300, // 5 minutes
WORKSPACE_METADATA: 600, // 10 minutes
TENANT_CONFIG: 3600, // 1 hour
STATE_CHECKSUM: 86400, // 24 hours (for change detection)
};

// Cache key patterns
const cacheKeys = {
entity: (tenantId: string, entityRef: string) =>
`entity:${tenantId}:${entityRef}`,

workspaceList: (tenantId: string) =>
`workspaces:${tenantId}:list`,

tenantConfig: (tenantId: string) =>
`tenant:${tenantId}:config`,
};

// Cache-aside pattern
async function getCatalogEntity(tenantId: string, entityRef: string): Promise<Entity> {
const cacheKey = cacheKeys.entity(tenantId, entityRef);

// Try cache first
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}

// Cache miss: fetch from database
const entity = await db.query(
'SELECT * FROM catalog_entities WHERE tenant_id = $1 AND entity_ref = $2',
[tenantId, entityRef]
);

// Store in cache
await redis.setex(cacheKey, CACHE_TTL.CATALOG_ENTITY, JSON.stringify(entity));

return entity;
}

5.5 Performance Projections

Capacity Planning (100 Clients):

ResourcePer Client100 ClientsNotes
Workspaces20020,000Average across clients
Entities1,000100,000Components, systems, APIs
State Syncs/Day50050,000~0.6 syncs/second average
API Requests/Day10,0001,000,000~12 req/second average
Database Size50 MB5 GBJSONB compressed
Queue Messages/Day1,000100,000Burst capacity: 1000 msg/sec

Cost Estimates (GCP us-central1):

ServiceSpecificationMonthly Cost
GKE Cluster10x n2-standard-8 nodes$2,500
Cloud SQL (PostgreSQL)db-custom-8-32GB (HA)$800
Cloud Pub/Sub100M messages/month$400
Cloud Storage100 GB (state backups)$2
Secret Manager1000 secrets x 10k accesses$30
Load Balancer10 TB ingress/egress$300
Total~$4,000/month

Cost per Client: ~$40/month (at 100 clients)


6. Dynamic Discovery & Onboarding

6.1 GitHub Repository Scanning

Discovery Flow:

Implementation:

// GitHub scanner service
class GitHubRepoScanner {
private octokit: Octokit;
private tenantId: string;

async scanOrganization(orgName: string): Promise<DiscoveryResult[]> {
const repos = await this.octokit.paginate(
'GET /orgs/{org}/repos',
{ org: orgName, per_page: 100 }
);

const results: DiscoveryResult[] = [];

for (const repo of repos) {
try {
// Fetch catalog-info.yaml from default branch
const catalogInfo = await this.fetchCatalogInfo(repo);

if (catalogInfo) {
const workspaceId = this.extractWorkspaceId(catalogInfo);

results.push({
repository: repo.full_name,
workspace_id: workspaceId,
entity_ref: catalogInfo.metadata.name,
discovered_at: new Date()
});

// Enqueue for processing
await this.enqueueDiscovery(catalogInfo, workspaceId);
}
} catch (error) {
console.error(`Failed to scan ${repo.full_name}:`, error);
}
}

return results;
}

private async fetchCatalogInfo(repo: Repository): Promise<CatalogInfo | null> {
try {
const response = await this.octokit.rest.repos.getContent({
owner: repo.owner.login,
repo: repo.name,
path: 'catalog-info.yaml',
ref: repo.default_branch
});

if ('content' in response.data) {
const content = Buffer.from(response.data.content, 'base64').toString();
return yaml.parse(content);
}
} catch (error) {
if (error.status === 404) {
return null; // No catalog-info.yaml
}
throw error;
}
}

private extractWorkspaceId(catalogInfo: CatalogInfo): string | null {
return catalogInfo.metadata.annotations?.['terraform.io/workspace-id'];
}
}

6.2 Terraform Cloud Workspace Enumeration

Bulk Workspace Discovery:

// Terraform Cloud workspace scanner
class TerraformCloudScanner {
private client: TerraformCloudClient;
private tenantId: string;

async enumerateWorkspaces(orgName: string): Promise<Workspace[]> {
const workspaces: Workspace[] = [];
let page = 1;
let hasMore = true;

while (hasMore) {
const response = await this.client.get(
`/organizations/${orgName}/workspaces`,
{
params: {
'page[size]': 100,
'page[number]': page
}
}
);

workspaces.push(...response.data);

hasMore = response.meta.pagination.next_page !== null;
page++;

// Rate limit protection
await this.rateLimiter.waitForCapacity(this.tenantId);
}

return workspaces;
}

async linkToRepository(workspace: Workspace): Promise<string | null> {
// Extract repository from workspace VCS settings
if (workspace.attributes['vcs-repo']) {
const vcsRepo = workspace.attributes['vcs-repo'];
return `https://github.com/${vcsRepo.identifier}`;
}
return null;
}
}

6.3 Automated Onboarding Workflow

New Business Unit Setup (< 5 minutes):

  1. Repository Creation (Manual): DevOps team creates bu-{name}-infra repo in GitHub
  2. catalog-info.yaml Template: CI/CD adds template during repo initialization
  3. Terraform Cloud Workspace: Created automatically via TFC API
  4. First Scan (Automated): Scheduled scanner picks up new repo within 5 minutes
  5. State Sync (Automated): Catalog processor fetches state and creates entity
  6. Backstage Visibility: Entity appears in catalog within 30 seconds

Onboarding Template (catalog-info.yaml):

apiVersion: backstage.io/v1alpha1
kind: System
metadata:
name: {{ tenant_slug }}-{{ business_unit }}-infrastructure
namespace: {{ tenant_slug }}
description: Infrastructure for {{ business_unit }} business unit
annotations:
# Automatically linked by discovery
terraform.io/workspace-id: "{{ workspace_id }}"
terraform.io/organization: "{{ tenant_slug }}-{{ environment }}"
github.com/project-slug: "{{ org }}/{{ repo }}"
backstage.io/techdocs-ref: dir:.
spec:
owner: {{ tenant_slug }}-{{ business_unit }}-team
domain: infrastructure
type: infrastructure

7. Plugin Architecture Components

7.1 Backstage Plugin Structure

backstage-plugin-terraform-cloud/
├── backend/ # Backend plugin
│ ├── src/
│ │ ├── service/
│ │ │ ├── router.ts # API routes
│ │ │ ├── TerraformCloudClient.ts
│ │ │ ├── StateSanitizer.ts
│ │ │ └── CatalogSync.ts
│ │ ├── processors/
│ │ │ └── TerraformCloudEntityProcessor.ts
│ │ ├── providers/
│ │ │ └── TerraformCloudEntityProvider.ts
│ │ └── plugin.ts
│ └── package.json

├── frontend/ # Frontend plugin
│ ├── src/
│ │ ├── components/
│ │ │ ├── TerraformWorkspaceCard/
│ │ │ ├── StateResourcesTable/
│ │ │ └── WorkspaceRunsTimeline/
│ │ ├── routes.ts
│ │ └── plugin.ts
│ └── package.json

├── common/ # Shared types
│ ├── src/
│ │ ├── types.ts
│ │ └── api.ts
│ └── package.json

└── docs/
├── setup.md
└── configuration.md

7.2 Backend Plugin (Catalog Processor)

Core Responsibilities:

  • Fetch Terraform Cloud state
  • Sanitize sensitive data
  • Transform state to Backstage entities
  • Handle webhook events

Implementation:

// backend/src/processors/TerraformCloudEntityProcessor.ts
import { CatalogProcessor, CatalogProcessorEmit } from '@backstage/plugin-catalog-node';
import { Entity } from '@backstage/catalog-model';
import { TerraformCloudClient } from '../service/TerraformCloudClient';
import { StateSanitizer } from '../service/StateSanitizer';

export class TerraformCloudEntityProcessor implements CatalogProcessor {
private client: TerraformCloudClient;
private sanitizer: StateSanitizer;

constructor(
client: TerraformCloudClient,
sanitizer: StateSanitizer
) {
this.client = client;
this.sanitizer = sanitizer;
}

getProcessorName(): string {
return 'TerraformCloudEntityProcessor';
}

async postProcessEntity(
entity: Entity,
_location: any,
emit: CatalogProcessorEmit
): Promise<Entity> {
// Check if entity has Terraform Cloud annotations
const workspaceId = entity.metadata.annotations?.['terraform.io/workspace-id'];

if (!workspaceId) {
return entity; // Not a Terraform-managed entity
}

try {
// Fetch latest state from Terraform Cloud
const state = await this.client.fetchWorkspaceState(workspaceId);

// Sanitize state
const sanitized = await this.sanitizer.sanitize(state);

// Extract resources and create child entities
for (const resource of sanitized.resources) {
const childEntity = this.createResourceEntity(entity, resource);
emit({ type: 'entity', entity: childEntity });
}

// Enrich parent entity with state metadata
entity.metadata.annotations = {
...entity.metadata.annotations,
'terraform.io/state-version': state.version,
'terraform.io/last-updated': state.updated_at,
'terraform.io/resource-count': sanitized.resources.length.toString()
};

return entity;
} catch (error) {
console.error(`Failed to process Terraform entity ${entity.metadata.name}:`, error);
return entity; // Return original entity on error
}
}

private createResourceEntity(parent: Entity, resource: any): Entity {
return {
apiVersion: 'backstage.io/v1alpha1',
kind: 'Resource',
metadata: {
name: `${parent.metadata.name}-${resource.type}-${resource.name}`,
namespace: parent.metadata.namespace,
annotations: {
'terraform.io/resource-address': resource.address,
'terraform.io/resource-type': resource.type,
'backstage.io/managed-by-location': `terraform:${parent.metadata.annotations?.['terraform.io/workspace-id']}`
}
},
spec: {
type: resource.type,
owner: parent.spec?.owner || 'unknown',
dependsOn: [`component:${parent.metadata.name}`],
...resource.values
}
};
}
}

7.3 Entity Provider (Real-Time Sync)

Webhook-Driven Updates:

// backend/src/providers/TerraformCloudEntityProvider.ts
import { EntityProvider, EntityProviderConnection } from '@backstage/plugin-catalog-node';
import { TerraformCloudClient } from '../service/TerraformCloudClient';

export class TerraformCloudEntityProvider implements EntityProvider {
private connection?: EntityProviderConnection;
private client: TerraformCloudClient;

constructor(client: TerraformCloudClient) {
this.client = client;
}

getProviderName(): string {
return 'TerraformCloudEntityProvider';
}

async connect(connection: EntityProviderConnection): Promise<void> {
this.connection = connection;

// Start periodic full sync (every 1 hour as backup)
setInterval(() => this.fullSync(), 3600000);

// Initial sync on startup
await this.fullSync();
}

async fullSync(): Promise<void> {
if (!this.connection) return;

console.log('Starting full Terraform Cloud sync...');

// Fetch all entities from database (already processed)
const entities = await this.fetchAllEntities();

// Apply entities to catalog
await this.connection.applyMutation({
type: 'full',
entities: entities.map(entity => ({
entity,
locationKey: `terraform:${entity.metadata.annotations?.['terraform.io/workspace-id']}`
}))
});

console.log(`Full sync completed: ${entities.length} entities`);
}

async handleWebhookEvent(event: TerraformWebhookEvent): Promise<void> {
if (!this.connection) return;

// Fetch updated entity
const entity = await this.client.fetchEntityForWorkspace(event.workspace_id);

// Apply delta update to catalog
await this.connection.applyMutation({
type: 'delta',
added: [entity],
removed: []
});

console.log(`Webhook sync completed for workspace ${event.workspace_id}`);
}
}

7.4 Frontend Plugin (UI Components)

Workspace Details Card:

// frontend/src/components/TerraformWorkspaceCard/TerraformWorkspaceCard.tsx
import React from 'react';
import { Entity } from '@backstage/catalog-model';
import { InfoCard, Link } from '@backstage/core-components';
import { useApi } from '@backstage/core-plugin-api';
import { terraformCloudApiRef } from '../../api';

export const TerraformWorkspaceCard = ({ entity }: { entity: Entity }) => {
const api = useApi(terraformCloudApiRef);
const workspaceId = entity.metadata.annotations?.['terraform.io/workspace-id'];

const [workspace, setWorkspace] = React.useState<any>(null);
const [loading, setLoading] = React.useState(true);

React.useEffect(() => {
if (workspaceId) {
api.getWorkspace(workspaceId).then(setWorkspace).finally(() => setLoading(false));
}
}, [workspaceId, api]);

if (loading) return <InfoCard title="Terraform Workspace">Loading...</InfoCard>;
if (!workspace) return null;

return (
<InfoCard title="Terraform Workspace">
<div>
<strong>Organization:</strong> {workspace.organization} <br />
<strong>Workspace:</strong> <Link to={workspace.url}>{workspace.name}</Link> <br />
<strong>State Version:</strong> {workspace.currentStateVersion} <br />
<strong>Last Updated:</strong> {new Date(workspace.updatedAt).toLocaleString()} <br />
<strong>Resource Count:</strong> {workspace.resourceCount} <br />
</div>
</InfoCard>
);
};

7.5 Configuration Schema

app-config.yaml Structure:

# Multi-tenant Terraform Cloud plugin configuration
terraformCloud:
# Enable/disable plugin
enabled: true

# Tenant-specific configurations
tenants:
- id: acme-corp
slug: acme-corp
organization: acme-corp-prod
auth:
tokenSecretRef: projects/my-project/secrets/tfc-acme-corp-token

sanitization:
rulesConfigPath: gs://backstage-config/acme-corp/sanitization-rules.yaml
allowlistPath: gs://backstage-config/acme-corp/allowlist.yaml

discovery:
github:
enabled: true
organizations:
- acme-corp
scanIntervalMinutes: 5

terraformCloud:
enabled: true
webhooksEnabled: true

- id: globex-inc
slug: globex-inc
organization: globex-production
auth:
tokenSecretRef: projects/my-project/secrets/tfc-globex-token

sanitization:
rulesConfigPath: gs://backstage-config/globex-inc/sanitization-rules.yaml

discovery:
github:
enabled: true
organizations:
- globex-inc
scanIntervalMinutes: 10

# Global rate limiting
rateLimiting:
enabled: true
requestsPerSecond: 30
burstSize: 50

# Webhook server configuration
webhooks:
enabled: true
port: 8080
path: /api/webhooks/terraform-cloud
signatureHeader: X-TFE-Notification-Signature

8. Security & Compliance Architecture

8.1 Security Boundaries

8.2 Zero Trust Architecture

Principles:

  1. Never Trust, Always Verify: Every request authenticated and authorized
  2. Least Privilege Access: Minimal permissions per component
  3. Assume Breach: Defense in depth with multiple layers
  4. Encrypt Everything: Data at rest, in transit, and in use

Implementation:

// Zero Trust middleware stack
app.use([
// 1. Validate request integrity
validateRequestSignature,

// 2. Authenticate request (JWT, API key, or mTLS)
authenticate,

// 3. Extract tenant context
extractTenantContext,

// 4. Authorize action against RBAC policies
authorize,

// 5. Inject tenant context into database session
injectTenantContext,

// 6. Audit log all requests
auditLog,

// 7. Rate limit per tenant
rateLimitByTenant
]);

// Example: RBAC policy
interface RBACPolicy {
tenant_id: string;
user_id: string;
permissions: {
action: 'read' | 'write' | 'admin';
resource: 'catalog' | 'config' | 'logs';
scope: string; // e.g., "namespace:acme-corp-dev"
}[];
}

async function authorize(req: Request, res: Response, next: NextFunction) {
const { tenantId, userId, action, resource } = req.context;

const hasPermission = await rbac.check({
tenant_id: tenantId,
user_id: userId,
action: action,
resource: resource
});

if (!hasPermission) {
return res.status(403).json({
error: 'Insufficient permissions',
required: `${action}:${resource}`
});
}

next();
}

8.3 Encryption Strategy

Data at Rest:

  • Database: PostgreSQL native encryption with customer-managed keys (CMEK)
  • Backups: AES-256 encrypted via Cloud Storage CMEK
  • Secrets: Google Secret Manager with automatic rotation

Data in Transit:

  • External: TLS 1.3 with perfect forward secrecy
  • Internal: mTLS between all services (service mesh)
  • Terraform Cloud API: TLS 1.3 + API token in headers

Data in Use:

  • Memory Encryption: Confidential VMs (AMD SEV-SNP) for sensitive workloads
  • Sanitization: In-memory processing, no disk writes for unredacted state

8.4 Compliance & Audit

SOC 2 Type II Controls:

ControlImplementationEvidence
CC6.1 - Logical AccessJWT authentication + RBACAccess logs, auth events
CC6.2 - Transmission IntegrityTLS 1.3 + certificate pinningTLS handshake logs
CC6.3 - Transmission ConfidentialityEnd-to-end encryptionEncryption audit logs
CC7.2 - System MonitoringPrometheus + Grafana + alertsMetrics dashboards
CC7.4 - Change ManagementGitOps + peer reviewGit commit history

Audit Logging:

-- Comprehensive audit table
CREATE TABLE security_audit_log (
log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
timestamp TIMESTAMP DEFAULT NOW(),

-- Request context
tenant_id UUID NOT NULL,
user_id VARCHAR(255),
ip_address INET,
user_agent TEXT,

-- Action details
action VARCHAR(100) NOT NULL, -- e.g., 'catalog:read', 'config:write'
resource VARCHAR(500) NOT NULL,
resource_id VARCHAR(255),

-- Outcome
status INTEGER NOT NULL, -- HTTP status code
error_message TEXT,

-- For forensics
request_id UUID NOT NULL,
session_id UUID,

INDEX idx_audit_tenant_time (tenant_id, timestamp DESC),
INDEX idx_audit_action (action, timestamp DESC),
INDEX idx_audit_user (user_id, timestamp DESC)
);

-- Example query: All failed authorization attempts in last 24h
SELECT
timestamp,
tenant_id,
user_id,
action,
resource
FROM security_audit_log
WHERE status = 403
AND timestamp > NOW() - INTERVAL '24 hours'
ORDER BY timestamp DESC;

9. Technology Stack Recommendations

9.1 Core Technologies

LayerTechnologyJustificationAlternatives
Backend RuntimeNode.js 20 LTSBackstage native, async I/ODeno (future)
Backend FrameworkExpress.jsBackstage plugin API compatibilityFastify (performance)
FrontendReact 18Backstage UI frameworkN/A
DatabasePostgreSQL 15Backstage catalog backend, RLS supportN/A
CacheRedis 7Low-latency caching, pub/subMemcached
Message QueueGoogle Cloud Pub/SubManaged, scalable, at-least-once deliveryRabbitMQ, Kafka
Container OrchestrationGKE (Kubernetes 1.28+)Managed, auto-scaling, GCP integrationEKS, AKS
Service MeshIstio 1.20mTLS, traffic management, observabilityLinkerd
Secret ManagementGoogle Secret ManagerManaged, automatic rotation, IAM integrationHashiCorp Vault
ObservabilityPrometheus + GrafanaIndustry standard, rich ecosystemDatadog, New Relic
CI/CDGitHub ActionsNative GitHub integrationGitLab CI, Cloud Build

9.2 Terraform Cloud SDK

Official SDK:

// Use Terraform Cloud API via @hashicorp/terraform-cloud SDK
import { TerraformCloud } from '@hashicorp/terraform-cloud';

const tfc = new TerraformCloud({
token: await secretManager.getSecret('tfc-token'),
organization: 'acme-corp-prod'
});

// Fetch workspace
const workspace = await tfc.workspaces.show('my-workspace');

// Fetch latest state
const stateVersion = await tfc.stateVersions.current(workspace.id);
const stateDownloadUrl = stateVersion.data.attributes['hosted-state-download-url'];

// Download state JSON
const state = await fetch(stateDownloadUrl).then(r => r.json());

9.3 Backstage Catalog Extensions

Custom Entity Kinds:

# Define Terraform-specific entity kinds
apiVersion: backstage.io/v1alpha1
kind: TerraformWorkspace
metadata:
name: acme-corp-prod-vpc
namespace: acme-corp
annotations:
terraform.io/workspace-id: ws-abc123
terraform.io/organization: acme-corp-prod
spec:
type: infrastructure
lifecycle: production
owner: platform-team

# Terraform-specific fields
terraform:
version: "1.6.0"
provider_versions:
google: "5.0.0"
kubernetes: "2.23.0"

locked: false
auto_apply: false

vcs:
repository: "acme-corp/infrastructure-vpc"
branch: "main"

variables:
- key: project_id
sensitive: false
- key: api_key
sensitive: true # Indicates sanitized

9.4 Monitoring & Alerting Stack

Metrics Collection:

# Prometheus ServiceMonitor for plugin
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: backstage-terraform-plugin
spec:
selector:
matchLabels:
app: backstage
component: terraform-plugin
endpoints:
- port: metrics
path: /metrics
interval: 30s

Key Metrics:

// Custom Prometheus metrics
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const metrics = {
// Counter: Total API requests
apiRequests: new Counter({
name: 'terraform_plugin_api_requests_total',
help: 'Total API requests',
labelNames: ['tenant_id', 'method', 'endpoint', 'status']
}),

// Histogram: API latency
apiLatency: new Histogram({
name: 'terraform_plugin_api_latency_seconds',
help: 'API request latency',
labelNames: ['tenant_id', 'endpoint'],
buckets: [0.1, 0.5, 1, 2, 5, 10]
}),

// Gauge: Active workspaces per tenant
activeWorkspaces: new Gauge({
name: 'terraform_plugin_active_workspaces',
help: 'Number of active workspaces',
labelNames: ['tenant_id']
}),

// Counter: Sanitization violations
sanitizationViolations: new Counter({
name: 'terraform_plugin_sanitization_violations_total',
help: 'Total sanitization violations detected',
labelNames: ['tenant_id', 'rule_id', 'severity']
})
};

Grafana Dashboards:

  • Tenant Overview: Workspaces, entities, API usage per tenant
  • Performance: Latency percentiles, error rates, queue depth
  • Security: Sanitization violations, auth failures, rate limit hits
  • Cost: API costs, database size, compute usage per tenant

10. Deployment Architecture

10.1 Kubernetes Deployment Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
name: backstage-terraform-plugin
namespace: backstage
spec:
replicas: 3
selector:
matchLabels:
app: backstage
component: terraform-plugin
template:
metadata:
labels:
app: backstage
component: terraform-plugin
spec:
serviceAccountName: backstage-terraform-sa

# Multi-container pod
containers:
# Main application
- name: plugin
image: gcr.io/my-project/backstage-terraform-plugin:v1.0.0
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics

env:
- name: NODE_ENV
value: production
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: postgres-credentials
key: connection-string
- name: REDIS_URL
value: redis://redis-service:6379

resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi

livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10

readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5

# Cloud SQL Proxy sidecar
- name: cloud-sql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:latest
command:
- "/cloud_sql_proxy"
- "-instances=my-project:us-central1:backstage-db=tcp:5432"

securityContext:
runAsNonRoot: true

# Security context
securityContext:
fsGroup: 1000
runAsUser: 1000
runAsGroup: 1000

10.2 Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: backstage-terraform-plugin-hpa
namespace: backstage
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: backstage-terraform-plugin
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

# Custom metric: Queue depth
- type: External
external:
metric:
name: pubsub.googleapis.com|subscription|num_undelivered_messages
selector:
matchLabels:
resource.labels.subscription_id: terraform-state-updates
target:
type: AverageValue
averageValue: "500"

behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 50 # Scale down by 50% at most
periodSeconds: 60

scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Double capacity at most
periodSeconds: 15
- type: Pods
value: 4 # Add 4 pods at most
periodSeconds: 15
selectPolicy: Max # Take the policy that scales fastest

11. Architecture Decision Records (ADRs)

ADR-001: Row-Level Security for Tenant Isolation

Status: Accepted Date: 2024-11-13

Context: Need to isolate tenant data in a cost-effective, scalable way while maintaining strong security boundaries.

Decision: Implement PostgreSQL Row-Level Security (RLS) with tenant discriminator column.

Rationale:

  • Security: Database-enforced isolation (not application-layer)
  • Cost: Single database instance for all tenants (vs. separate DBs)
  • Performance: Indexed tenant_id column, partitioning by tenant
  • Compliance: Meets SOC 2 data isolation requirements

Consequences:

  • Positive: Lower operational overhead, simpler backups
  • Negative: RLS policy complexity, PostgreSQL version dependency (9.5+)
  • Mitigation: Extensive testing of RLS policies, monitoring for leaks

ADR-002: Pub/Sub for Asynchronous Processing

Status: Accepted Date: 2024-11-13

Context: Need to handle burst traffic (1000+ workspace updates/minute) without blocking API requests.

Decision: Use Google Cloud Pub/Sub for message queuing between API and workers.

Rationale:

  • Scalability: Managed service, auto-scaling to millions of messages
  • Reliability: At-least-once delivery, dead letter queues
  • Integration: Native GCP integration, IAM-based auth

Consequences:

  • Positive: Decoupled architecture, elastic capacity
  • Negative: Potential message duplication (at-least-once), GCP vendor lock-in
  • Mitigation: Idempotent message handlers, multi-cloud strategy (future)

ADR-003: Real-Time Sync via Terraform Cloud Webhooks

Status: Accepted Date: 2024-11-13

Context: Users expect catalog updates within 5 minutes of Terraform apply.

Decision: Subscribe to Terraform Cloud webhooks for run completion events.

Rationale:

  • Latency: Near-real-time updates (< 30 seconds) vs. polling (5-15 minutes)
  • Efficiency: Event-driven reduces unnecessary API calls (rate limits)
  • Cost: Lower Terraform Cloud API usage

Consequences:

  • Positive: Faster catalog updates, better user experience
  • Negative: Webhook reliability dependency, HMAC signature verification overhead
  • Mitigation: Fallback polling every 1 hour, webhook retry logic

ADR-004: In-Memory State Sanitization

Status: Accepted Date: 2024-11-13

Context: Terraform state contains sensitive data (credentials, IPs) that must not reach Backstage catalog.

Decision: Sanitize state in-memory before database persistence, never write unredacted state to disk.

Rationale:

  • Security: Reduces attack surface (no plaintext state on disk)
  • Compliance: GDPR/CCPA compliance (no persistent PII)
  • Performance: In-memory processing faster than disk I/O

Consequences:

  • Positive: Stronger security posture, faster processing
  • Negative: Higher memory requirements, complex sanitization logic
  • Mitigation: Worker memory limits (4GB), sanitization rule versioning

12. Risk Analysis & Mitigation

12.1 High-Severity Risks

RiskProbabilityImpactMitigation
Cross-Tenant Data LeakLowCriticalRLS policies, extensive testing, audit logging, automated leak detection
Terraform Cloud Rate LimitMediumHighPer-tenant quotas, token bucket algorithm, caching, batch processing
Webhook Replay AttackLowMediumHMAC signature verification, timestamp validation, nonce tracking
Database OutageLowHighHigh availability (Cloud SQL HA), automatic failover, connection pooling
PII Exposure in CatalogLowCriticalMulti-layer sanitization, regex + ML detection, audit trail, manual review queue

12.2 Mitigation Strategies

Cross-Tenant Data Leak Prevention:

-- Automated leak detection query (run hourly)
SELECT
entity_ref,
tenant_id,
COUNT(*) OVER (PARTITION BY entity_ref) as duplicate_count
FROM catalog_entities
WHERE entity_ref IN (
SELECT entity_ref
FROM catalog_entities
GROUP BY entity_ref
HAVING COUNT(DISTINCT tenant_id) > 1
);
-- Alert if any results (entity visible to multiple tenants)

Rate Limit Handling:

// Exponential backoff with jitter
async function retryWithBackoff<T>(
fn: () => Promise<T>,
maxRetries: number = 5
): Promise<T> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (error.status === 429 && attempt < maxRetries - 1) {
const backoff = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s, 8s, 16s
const jitter = Math.random() * 1000;
await sleep(backoff + jitter);
} else {
throw error;
}
}
}
throw new Error('Max retries exceeded');
}

13. Scalability Projections

13.1 Growth Model (10 Clients → 100 Clients)

Metric10 Clients50 Clients100 ClientsNotes
Workspaces2,00010,00020,000Linear growth
Entities10,00050,000100,0005 entities/workspace
Daily State Syncs5,00025,00050,000~0.5 syncs/workspace/day
Database Size500 MB2.5 GB5 GBCompressed JSONB
API Requests/Day100K500K1M10K req/client/day
Queue Messages/Day10K50K100K1K msg/client/day
GKE Nodes3610n2-standard-8
Monthly Cost$1,200$2,500$4,000$40/client at scale

13.2 Breaking Points & Solutions

Database Query Performance (100K+ entities):

  • Problem: Full table scans slow down at 100K+ rows
  • Solution: Partitioning by tenant_id (10 partitions), covering indexes
  • Target: < 100ms p95 query latency

Terraform Cloud Rate Limits (30 req/sec shared):

  • Problem: 100 clients competing for 30 req/sec quota
  • Solution: Per-tenant quotas (0.3 req/sec each), intelligent caching (1-hour TTL)
  • Target: < 5% rate limit rejections

Memory Pressure (sanitization workload):

  • Problem: In-memory sanitization requires 100-500MB per state
  • Solution: Worker pods with 4GB memory, queue-based distribution, streaming JSON parsing
  • Target: < 2GB memory per worker at 80% utilization

14. Implementation Roadmap

Phase 1: Foundation (Weeks 1-4)

  • Set up multi-tenant PostgreSQL database with RLS
  • Implement Terraform Cloud API client with rate limiting
  • Build basic state sanitization engine
  • Deploy backend plugin to GKE (single tenant PoC)

Phase 2: Multi-Tenant Core (Weeks 5-8)

  • Implement tenant context middleware
  • Build API key authentication system
  • Add per-tenant sanitization rules
  • Deploy Pub/Sub message queue
  • Implement catalog entity processor

Phase 3: Dynamic Discovery (Weeks 9-12)

  • Build GitHub repository scanner
  • Implement Terraform Cloud workspace enumeration
  • Add webhook event handling
  • Deploy automated onboarding workflow

Phase 4: Frontend & Polish (Weeks 13-16)

  • Build React UI components
  • Add Terraform workspace detail cards
  • Implement admin dashboard
  • Write end-to-end tests

Phase 5: Production Readiness (Weeks 17-20)

  • Load testing (10K concurrent users)
  • Security audit (SOC 2 prep)
  • Performance optimization
  • Documentation & runbooks

15. Conclusion

This architecture provides an enterprise-grade foundation for a multi-tenant Backstage plugin that integrates with Terraform Cloud at scale. Key design principles:

  1. Security First: Zero trust, encryption everywhere, tenant isolation
  2. Scalability: Horizontal scaling, queue-based architecture, database partitioning
  3. Real-Time: Webhook-driven updates, sub-5-minute latency
  4. Cost-Effective: Shared infrastructure, $40/client at 100 clients
  5. Compliance: SOC 2 Type II ready, comprehensive audit logging

Success Metrics:

  • 100+ enterprise clients supported
  • 99.9% uptime SLA
  • < 5 minute catalog sync latency
  • < 200ms API response time (p95)
  • Zero cross-tenant data leaks

Next Steps:

  1. Review with stakeholders (security, infrastructure, product)
  2. Finalize technology stack and vendor selection
  3. Begin Phase 1 implementation (foundation)
  4. Set up continuous integration and deployment pipelines

Appendix A: Glossary

  • RLS: Row-Level Security (PostgreSQL feature)
  • CMEK: Customer-Managed Encryption Keys
  • HPA: Horizontal Pod Autoscaler (Kubernetes)
  • TFC: Terraform Cloud
  • RBAC: Role-Based Access Control
  • SOC 2: Service Organization Control 2 (compliance standard)

Appendix B: References


Document Control

VersionDateAuthorChanges
1.02024-11-13System Architect AgentInitial architecture design