Enterprise Multi-Tenant Backstage Plugin Architecture
Terraform Cloud Integration for SaaS Platform
Document Version: 1.0 Date: November 13, 2024 Status: Architecture Design - Enterprise Edition Classification: Internal Technical Design
Executive Summary
This document defines the enterprise-grade architecture for a multi-tenant Backstage plugin that integrates with Terraform Cloud to provide infrastructure catalog management across hundreds of client organizations. The design prioritizes security, scalability, tenant isolation, and real-time synchronization while maintaining a shared SaaS codebase.
Key Metrics & Targets
| Metric | Target | Scale Factor |
|---|---|---|
| Clients Supported | 100+ enterprises | 10x current |
| Workspaces per Client | 50-500 | Variable |
| Total Entities | 100,000+ | 10x current |
| API Rate Limit | 30 req/sec (Terraform Cloud) | Shared resource |
| Sync Latency | < 5 minutes (real-time) | Event-driven |
| Data Isolation | 100% (zero cross-tenant leaks) | Critical |
| Uptime SLA | 99.9% | Enterprise-grade |
1. System Context & Requirements
1.1 Business Context
Problem Statement:
- Multiple enterprise clients need infrastructure visibility in Backstage
- Each client has 10-100 business units with independent infrastructure repositories
- Business units are dynamically created (M&A, reorganization, new initiatives)
- Current manual catalog maintenance doesn't scale beyond 50 repositories
- Security teams require sensitive data never appears in Backstage
Solution Vision: A multi-tenant SaaS Backstage plugin that:
- Automatically discovers infrastructure repositories across GitHub organizations
- Pulls Terraform state from Terraform Cloud workspaces
- Sanitizes sensitive data before catalog ingestion
- Maintains strict tenant isolation in shared database
- Updates catalog in near-real-time (< 5 minutes)
- Scales to 100+ clients with 1000+ repositories each
1.2 Critical Requirements
Functional Requirements
-
Terraform Cloud Integration
- Authenticate via organization tokens, team tokens, or user tokens
- Discover all workspaces across multiple organizations
- Download latest state versions via API
- Subscribe to workspace run webhooks for real-time updates
- Handle pagination (100+ workspace pages)
-
Multi-Tenant Data Isolation
- Client data NEVER visible to other clients
- Separate authentication per client (API keys, JWT tokens)
- Audit logging of all cross-tenant access attempts
- Configurable tenant-specific sanitization rules
-
State Sanitization
- Detect and redact PII (emails, names, addresses)
- Strip credentials (passwords, API keys, tokens, certificates)
- Remove service account keys and private keys
- Configurable allowlists per client
- Audit trail of all redactions
-
Dynamic Discovery
- Auto-detect new business unit repositories (GitHub org scanning)
- Extract metadata from catalog-info.yaml
- Link repositories to Terraform Cloud workspaces
- Detect deleted/archived repositories
-
Scalability
- Process 1000+ workspace updates concurrently
- Queue-based architecture for burst handling
- Database partitioning for 100+ clients
- CDN caching for static catalog data
Non-Functional Requirements
-
Security
- SOC 2 Type II compliance
- Encryption at rest (AES-256) and in transit (TLS 1.3)
- Zero trust network architecture
- Role-based access control (RBAC) per tenant
-
Performance
- < 200ms API response time (p95)
- < 5 minute sync latency for state updates
- Support 10,000 concurrent users
- Database queries < 100ms (p95)
-
Reliability
- 99.9% uptime SLA
- Automatic failover for database and queue
- Graceful degradation under load
- Circuit breakers for external API calls
-
Observability
- Structured logging with trace IDs
- Metrics dashboards per tenant
- Alerting for anomalies
- Cost tracking per tenant
2. Terraform Cloud Integration Architecture
2.1 API Authentication Strategy
Authentication Hierarchy
| Token Type | Scope | Use Case | Rotation |
|---|---|---|---|
| Organization Token | Org-wide read/write | Initial setup, workspace creation | Manual (yearly) |
| Team Token | Team-scoped read | Read workspace states for specific teams | Automatic (90 days) |
| User Token | User-scoped read | Development/testing only | Automatic (30 days) |
Implementation Details:
// Token hierarchy with automatic fallback
interface TerraformCloudAuth {
tenantId: string;
primaryToken: string; // Organization token
fallbackTokens: string[]; // Team tokens for redundancy
rotationPolicy: {
enabled: boolean;
intervalDays: number;
alertBeforeDays: number;
};
}
// Token storage in Google Secret Manager
const secretPath = `projects/${PROJECT_ID}/secrets/tfc-token-${tenantId}/versions/latest`;
2.2 Workspace Discovery API
Flow:
- Organization Listing:
GET /api/v2/organizations - Workspace Pagination:
GET /api/v2/organizations/{org}/workspaces?page[size]=100&page[number]=1 - State Version Retrieval:
GET /api/v2/workspaces/{workspace_id}/current-state-version - State Download:
GET {state_version.hosted_state_download_url}
Rate Limiting Strategy:
- Terraform Cloud: 30 requests/second (shared across all users)
- Plugin: Implements token bucket algorithm
- Per-tenant quota: 5 requests/second
- Burst allowance: 50 requests
// Rate limiter implementation
class TerraformCloudRateLimiter {
private buckets: Map<string, TokenBucket> = new Map();
async checkLimit(tenantId: string): Promise<boolean> {
const bucket = this.buckets.get(tenantId);
if (!bucket) {
this.buckets.set(tenantId, new TokenBucket(5, 5)); // 5 req/sec
}
return bucket.tryConsume(1);
}
async waitForCapacity(tenantId: string): Promise<void> {
while (!await this.checkLimit(tenantId)) {
await sleep(200); // Wait 200ms before retry
}
}
}
2.3 Webhook Event Processing
Real-Time Updates via Terraform Cloud Webhooks:
# Webhook Configuration
webhooks:
- name: "backstage-catalog-sync"
enabled: true
url: "https://plugin.backstage.example.com/api/webhooks/terraform-cloud"
token: "<webhook-secret-token>"
events:
- "run:completed"
- "run:errored"
- "workspace:created"
- "workspace:deleted"
Event Processing Flow:
Webhook Security:
- HMAC SHA-256 signature verification
- IP allowlist for Terraform Cloud endpoints
- Replay attack prevention (timestamp + nonce)
- Rate limiting per tenant (1000 events/hour)
3. Multi-Tenant Data Architecture
3.1 Tenant Isolation Strategy
Option Analysis:
| Strategy | Pros | Cons | Recommendation |
|---|---|---|---|
| Separate Databases | Maximum isolation, simple RBAC | High cost, complex backups | ❌ Not scalable |
| Separate Schemas | Good isolation, moderate cost | Schema migrations complex | ⚠️ Fallback option |
| Row-Level Security (RLS) | Cost-effective, single DB | Complex policies, performance overhead | ✅ Primary choice |
| Discriminator Column | Simple to implement | No DB-level isolation, risky | ❌ Insufficient security |
Selected Approach: PostgreSQL Row-Level Security (RLS) + Tenant Column
-- Tenant isolation table
CREATE TABLE tenants (
tenant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
org_slug VARCHAR(100) UNIQUE NOT NULL,
plan_tier VARCHAR(50) DEFAULT 'enterprise',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Catalog entities with tenant column
CREATE TABLE catalog_entities (
entity_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
entity_ref VARCHAR(500) NOT NULL,
entity_kind VARCHAR(100) NOT NULL,
entity_name VARCHAR(255) NOT NULL,
metadata JSONB NOT NULL,
spec JSONB NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
CONSTRAINT unique_entity_per_tenant UNIQUE (tenant_id, entity_ref)
);
-- Enable Row-Level Security
ALTER TABLE catalog_entities ENABLE ROW LEVEL SECURITY;
-- Policy: Users can only see their tenant's data
CREATE POLICY tenant_isolation_policy ON catalog_entities
USING (tenant_id = current_setting('app.current_tenant')::UUID);
-- Index for performance
CREATE INDEX idx_entities_tenant_id ON catalog_entities(tenant_id);
CREATE INDEX idx_entities_kind_name ON catalog_entities(tenant_id, entity_kind, entity_name);
Tenant Context Injection:
// Middleware to set tenant context
app.use(async (req, res, next) => {
const tenantId = await extractTenantId(req); // From JWT, API key, or header
if (!tenantId) {
return res.status(401).json({ error: 'Missing tenant context' });
}
// Validate tenant exists and is active
const tenant = await db.query('SELECT * FROM tenants WHERE tenant_id = $1 AND status = $2',
[tenantId, 'active']);
if (!tenant) {
return res.status(403).json({ error: 'Invalid or inactive tenant' });
}
// Set PostgreSQL session variable for RLS
await db.query('SET LOCAL app.current_tenant = $1', [tenantId]);
req.tenantId = tenantId;
next();
});
3.2 Tenant Identification Methods
Priority Order:
- JWT Token (Production): Tenant ID embedded in signed JWT claims
- API Key (Service-to-Service): Hashed API key mapped to tenant
- Header Override (Development):
X-Tenant-IDheader (disabled in prod)
// JWT Token Structure
interface BackstageTenantJWT {
sub: string; // User ID
tenant_id: string; // Primary tenant identifier
tenant_slug: string; // Human-readable tenant name
permissions: string[]; // Backstage permissions
iat: number; // Issued at
exp: number; // Expiration (1 hour)
}
// API Key Structure (hashed in database)
interface APIKey {
key_id: string;
tenant_id: string;
key_hash: string; // bcrypt hash of API key
scopes: string[]; // e.g., ['read:catalog', 'write:webhooks']
expires_at: Date;
last_used_at: Date;
}
3.3 Entity Naming Conventions
Cross-Tenant Uniqueness:
# Standard entity ref format
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
# Client-prefixed names prevent collisions
name: acme-corp-payment-service
namespace: acme-corp # Tenant slug as namespace
annotations:
backstage.io/techdocs-ref: dir:.
backstage.io/source-location: url:https://github.com/acme-corp/payment-service
terraform.io/workspace-id: ws-abc123def456
terraform.io/organization: acme-corp-prod
spec:
type: service
lifecycle: production
owner: acme-corp-platform-team
system: acme-corp-payments
Namespace Hierarchy:
{tenant-slug} # Root namespace (e.g., acme-corp)
├── {tenant-slug}-infrastructure # Infrastructure components
├── {tenant-slug}-platform # Platform services
└── {tenant-slug}-{business-unit} # Business unit namespaces
├── {bu}-development
├── {bu}-staging
└── {bu}-production
4. State Sanitization Pipeline Architecture
4.1 Sanitization Workflow
4.2 Sensitive Data Detection Rules
Rule Categories:
| Category | Detection Method | Example Patterns | Action |
|---|---|---|---|
| PII | Regex + ML | Email, SSN, phone numbers | Redact |
| Credentials | Keyword + entropy | API keys, passwords, tokens | Redact |
| Private Keys | Format detection | RSA/EC keys, certificates | Redact |
| Cloud Secrets | Provider-specific | GCP service account keys, AWS access keys | Redact |
| Database Credentials | Connection strings | postgres://user:pass@host | Redact |
| IP Addresses | Regex (private ranges) | 10.x.x.x, 192.168.x.x (configurable) | Redact or Allow |
Detection Engine:
interface SanitizationRule {
id: string;
name: string;
category: 'pii' | 'credential' | 'infrastructure';
enabled: boolean;
pattern: RegExp | string;
action: 'redact' | 'hash' | 'allow';
priority: number; // Higher priority rules run first
}
class StateSanitizer {
private rules: SanitizationRule[];
private allowlists: Map<string, Set<string>>; // Per-tenant allowlists
constructor(tenantId: string) {
this.rules = this.loadRules(tenantId);
this.allowlists = this.loadAllowlists(tenantId);
}
async sanitize(state: TerraformState): Promise<SanitizedState> {
const violations: Violation[] = [];
const redactedState = cloneDeep(state);
// Traverse JSON recursively
this.traverseAndSanitize(redactedState.resources, violations);
// Log all redactions for audit
await this.auditLog.record({
tenantId: this.tenantId,
stateId: state.id,
violations: violations,
timestamp: new Date()
});
return {
sanitized: redactedState,
violations: violations,
safe: violations.length === 0
};
}
private traverseAndSanitize(obj: any, violations: Violation[], path: string = ''): void {
for (const [key, value] of Object.entries(obj)) {
const currentPath = `${path}.${key}`;
// Check if key itself is sensitive (e.g., "password", "api_key")
if (this.isSensitiveKey(key)) {
violations.push({ path: currentPath, rule: 'sensitive-key', value: '***' });
obj[key] = '[REDACTED]';
continue;
}
// Check if value matches sensitive patterns
if (typeof value === 'string') {
const match = this.matchesRule(value);
if (match && !this.isAllowlisted(currentPath, value)) {
violations.push({ path: currentPath, rule: match.id, value: '***' });
obj[key] = `[REDACTED:${match.category}]`;
}
} else if (typeof value === 'object') {
this.traverseAndSanitize(value, violations, currentPath);
}
}
}
}
4.3 Configurable Sanitization Rules
Per-Tenant Configuration:
# Example: acme-corp sanitization config
tenant_id: acme-corp
sanitization:
global_rules:
- id: strip-passwords
enabled: true
pattern: "(password|passwd|pwd)\\s*[:=]\\s*['\"]?([^'\"\\s]+)"
action: redact
- id: strip-api-keys
enabled: true
pattern: "[a-zA-Z0-9]{32,}" # High-entropy strings
min_entropy: 4.5
action: redact
allowlists:
# Allow specific IP ranges
- pattern: "10\\.128\\..*"
reason: "Internal VPC CIDR"
# Allow specific service account emails
- pattern: "terraform@acme-corp\\.iam\\.gserviceaccount\\.com"
reason: "Public service account for Terraform Cloud"
custom_rules:
- id: redact-internal-domains
pattern: ".*\\.internal\\.acme\\.com"
action: redact
reason: "Internal domain names are confidential"
4.4 Audit Logging
Audit Log Schema:
CREATE TABLE sanitization_audit_log (
log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
workspace_id VARCHAR(255) NOT NULL,
state_version VARCHAR(100) NOT NULL,
violations JSONB NOT NULL, -- Array of violation objects
sanitized_at TIMESTAMP DEFAULT NOW(),
-- For compliance reporting
pii_count INTEGER DEFAULT 0,
credential_count INTEGER DEFAULT 0,
total_redactions INTEGER DEFAULT 0,
INDEX idx_audit_tenant_time (tenant_id, sanitized_at),
INDEX idx_audit_workspace (workspace_id)
);
-- Example query: Violations by tenant over last 30 days
SELECT
tenant_id,
COUNT(*) as total_sanitizations,
SUM(pii_count) as total_pii_redactions,
SUM(credential_count) as total_credential_redactions
FROM sanitization_audit_log
WHERE sanitized_at > NOW() - INTERVAL '30 days'
GROUP BY tenant_id;
5. Scalability Architecture
5.1 System Components
5.2 Horizontal Scaling Strategy
Auto-Scaling Policies:
| Component | Metric | Scale Up Threshold | Scale Down Threshold | Min/Max Replicas |
|---|---|---|---|---|
| Backend API | CPU Utilization | > 70% | < 30% | 3 / 50 |
| Catalog Processor | Queue Depth | > 1000 messages | < 100 messages | 5 / 100 |
| Webhook Handler | Request Rate | > 500 req/sec | < 100 req/sec | 2 / 20 |
Kubernetes HorizontalPodAutoscaler:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: catalog-processor-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: catalog-processor
minReplicas: 5
maxReplicas: 100
metrics:
- type: External
external:
metric:
name: pubsub.googleapis.com|subscription|num_undelivered_messages
selector:
matchLabels:
resource.labels.subscription_id: terraform-state-updates
target:
type: AverageValue
averageValue: "1000"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
5.3 Queue Architecture
Message Queue Design (Google Cloud Pub/Sub):
topics:
- name: terraform-workspace-discovered
description: "New workspace discovered via GitHub scan or TFC webhook"
message_retention: 7 days
subscriptions:
- name: catalog-processor-subscription
ack_deadline: 600s # 10 minutes for state processing
retry_policy:
minimum_backoff: 10s
maximum_backoff: 600s
- name: terraform-state-updated
description: "State version updated in Terraform Cloud"
message_retention: 7 days
subscriptions:
- name: state-sync-subscription
ack_deadline: 300s
retry_policy:
minimum_backoff: 5s
maximum_backoff: 300s
- name: sanitization-failed
description: "Dead letter queue for sanitization failures"
message_retention: 30 days
subscriptions:
- name: manual-review-subscription
ack_deadline: 3600s # 1 hour for manual review
Message Schema:
interface WorkspaceDiscoveredMessage {
tenant_id: string;
workspace_id: string;
workspace_name: string;
organization: string;
repository_url: string;
discovered_at: string;
discovery_method: 'github-scan' | 'terraform-webhook' | 'manual';
}
interface StateUpdatedMessage {
tenant_id: string;
workspace_id: string;
state_version: string;
run_id: string;
updated_at: string;
priority: 'high' | 'normal' | 'low';
}
5.4 Database Optimization
Partitioning Strategy:
-- Partition catalog_entities by tenant_id (hash partitioning)
CREATE TABLE catalog_entities (
entity_id UUID DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
entity_ref VARCHAR(500) NOT NULL,
metadata JSONB NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
...
) PARTITION BY HASH (tenant_id);
-- Create 10 partitions (adjust based on tenant count)
CREATE TABLE catalog_entities_p0 PARTITION OF catalog_entities
FOR VALUES WITH (MODULUS 10, REMAINDER 0);
CREATE TABLE catalog_entities_p1 PARTITION OF catalog_entities
FOR VALUES WITH (MODULUS 10, REMAINDER 1);
-- ... create p2 through p9
-- Index on each partition
CREATE INDEX idx_entities_p0_ref ON catalog_entities_p0(entity_ref);
CREATE INDEX idx_entities_p1_ref ON catalog_entities_p1(entity_ref);
-- ... create indexes on p2 through p9
Caching Strategy (Redis):
// Cache layers
const CACHE_TTL = {
CATALOG_ENTITY: 300, // 5 minutes
WORKSPACE_METADATA: 600, // 10 minutes
TENANT_CONFIG: 3600, // 1 hour
STATE_CHECKSUM: 86400, // 24 hours (for change detection)
};
// Cache key patterns
const cacheKeys = {
entity: (tenantId: string, entityRef: string) =>
`entity:${tenantId}:${entityRef}`,
workspaceList: (tenantId: string) =>
`workspaces:${tenantId}:list`,
tenantConfig: (tenantId: string) =>
`tenant:${tenantId}:config`,
};
// Cache-aside pattern
async function getCatalogEntity(tenantId: string, entityRef: string): Promise<Entity> {
const cacheKey = cacheKeys.entity(tenantId, entityRef);
// Try cache first
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}
// Cache miss: fetch from database
const entity = await db.query(
'SELECT * FROM catalog_entities WHERE tenant_id = $1 AND entity_ref = $2',
[tenantId, entityRef]
);
// Store in cache
await redis.setex(cacheKey, CACHE_TTL.CATALOG_ENTITY, JSON.stringify(entity));
return entity;
}
5.5 Performance Projections
Capacity Planning (100 Clients):
| Resource | Per Client | 100 Clients | Notes |
|---|---|---|---|
| Workspaces | 200 | 20,000 | Average across clients |
| Entities | 1,000 | 100,000 | Components, systems, APIs |
| State Syncs/Day | 500 | 50,000 | ~0.6 syncs/second average |
| API Requests/Day | 10,000 | 1,000,000 | ~12 req/second average |
| Database Size | 50 MB | 5 GB | JSONB compressed |
| Queue Messages/Day | 1,000 | 100,000 | Burst capacity: 1000 msg/sec |
Cost Estimates (GCP us-central1):
| Service | Specification | Monthly Cost |
|---|---|---|
| GKE Cluster | 10x n2-standard-8 nodes | $2,500 |
| Cloud SQL (PostgreSQL) | db-custom-8-32GB (HA) | $800 |
| Cloud Pub/Sub | 100M messages/month | $400 |
| Cloud Storage | 100 GB (state backups) | $2 |
| Secret Manager | 1000 secrets x 10k accesses | $30 |
| Load Balancer | 10 TB ingress/egress | $300 |
| Total | ~$4,000/month |
Cost per Client: ~$40/month (at 100 clients)
6. Dynamic Discovery & Onboarding
6.1 GitHub Repository Scanning
Discovery Flow:
Implementation:
// GitHub scanner service
class GitHubRepoScanner {
private octokit: Octokit;
private tenantId: string;
async scanOrganization(orgName: string): Promise<DiscoveryResult[]> {
const repos = await this.octokit.paginate(
'GET /orgs/{org}/repos',
{ org: orgName, per_page: 100 }
);
const results: DiscoveryResult[] = [];
for (const repo of repos) {
try {
// Fetch catalog-info.yaml from default branch
const catalogInfo = await this.fetchCatalogInfo(repo);
if (catalogInfo) {
const workspaceId = this.extractWorkspaceId(catalogInfo);
results.push({
repository: repo.full_name,
workspace_id: workspaceId,
entity_ref: catalogInfo.metadata.name,
discovered_at: new Date()
});
// Enqueue for processing
await this.enqueueDiscovery(catalogInfo, workspaceId);
}
} catch (error) {
console.error(`Failed to scan ${repo.full_name}:`, error);
}
}
return results;
}
private async fetchCatalogInfo(repo: Repository): Promise<CatalogInfo | null> {
try {
const response = await this.octokit.rest.repos.getContent({
owner: repo.owner.login,
repo: repo.name,
path: 'catalog-info.yaml',
ref: repo.default_branch
});
if ('content' in response.data) {
const content = Buffer.from(response.data.content, 'base64').toString();
return yaml.parse(content);
}
} catch (error) {
if (error.status === 404) {
return null; // No catalog-info.yaml
}
throw error;
}
}
private extractWorkspaceId(catalogInfo: CatalogInfo): string | null {
return catalogInfo.metadata.annotations?.['terraform.io/workspace-id'];
}
}
6.2 Terraform Cloud Workspace Enumeration
Bulk Workspace Discovery:
// Terraform Cloud workspace scanner
class TerraformCloudScanner {
private client: TerraformCloudClient;
private tenantId: string;
async enumerateWorkspaces(orgName: string): Promise<Workspace[]> {
const workspaces: Workspace[] = [];
let page = 1;
let hasMore = true;
while (hasMore) {
const response = await this.client.get(
`/organizations/${orgName}/workspaces`,
{
params: {
'page[size]': 100,
'page[number]': page
}
}
);
workspaces.push(...response.data);
hasMore = response.meta.pagination.next_page !== null;
page++;
// Rate limit protection
await this.rateLimiter.waitForCapacity(this.tenantId);
}
return workspaces;
}
async linkToRepository(workspace: Workspace): Promise<string | null> {
// Extract repository from workspace VCS settings
if (workspace.attributes['vcs-repo']) {
const vcsRepo = workspace.attributes['vcs-repo'];
return `https://github.com/${vcsRepo.identifier}`;
}
return null;
}
}
6.3 Automated Onboarding Workflow
New Business Unit Setup (< 5 minutes):
- Repository Creation (Manual): DevOps team creates
bu-{name}-infrarepo in GitHub - catalog-info.yaml Template: CI/CD adds template during repo initialization
- Terraform Cloud Workspace: Created automatically via TFC API
- First Scan (Automated): Scheduled scanner picks up new repo within 5 minutes
- State Sync (Automated): Catalog processor fetches state and creates entity
- Backstage Visibility: Entity appears in catalog within 30 seconds
Onboarding Template (catalog-info.yaml):
apiVersion: backstage.io/v1alpha1
kind: System
metadata:
name: {{ tenant_slug }}-{{ business_unit }}-infrastructure
namespace: {{ tenant_slug }}
description: Infrastructure for {{ business_unit }} business unit
annotations:
# Automatically linked by discovery
terraform.io/workspace-id: "{{ workspace_id }}"
terraform.io/organization: "{{ tenant_slug }}-{{ environment }}"
github.com/project-slug: "{{ org }}/{{ repo }}"
backstage.io/techdocs-ref: dir:.
spec:
owner: {{ tenant_slug }}-{{ business_unit }}-team
domain: infrastructure
type: infrastructure
7. Plugin Architecture Components
7.1 Backstage Plugin Structure
backstage-plugin-terraform-cloud/
├── backend/ # Backend plugin
│ ├── src/
│ │ ├── service/
│ │ │ ├── router.ts # API routes
│ │ │ ├── TerraformCloudClient.ts
│ │ │ ├── StateSanitizer.ts
│ │ │ └── CatalogSync.ts
│ │ ├── processors/
│ │ │ └── TerraformCloudEntityProcessor.ts
│ │ ├── providers/
│ │ │ └── TerraformCloudEntityProvider.ts
│ │ └── plugin.ts
│ └── package.json
│
├── frontend/ # Frontend plugin
│ ├── src/
│ │ ├── components/
│ │ │ ├── TerraformWorkspaceCard/
│ │ │ ├── StateResourcesTable/
│ │ │ └── WorkspaceRunsTimeline/
│ │ ├── routes.ts
│ │ └── plugin.ts
│ └── package.json
│
├── common/ # Shared types
│ ├── src/
│ │ ├── types.ts
│ │ └── api.ts
│ └── package.json
│
└── docs/
├── setup.md
└── configuration.md
7.2 Backend Plugin (Catalog Processor)
Core Responsibilities:
- Fetch Terraform Cloud state
- Sanitize sensitive data
- Transform state to Backstage entities
- Handle webhook events
Implementation:
// backend/src/processors/TerraformCloudEntityProcessor.ts
import { CatalogProcessor, CatalogProcessorEmit } from '@backstage/plugin-catalog-node';
import { Entity } from '@backstage/catalog-model';
import { TerraformCloudClient } from '../service/TerraformCloudClient';
import { StateSanitizer } from '../service/StateSanitizer';
export class TerraformCloudEntityProcessor implements CatalogProcessor {
private client: TerraformCloudClient;
private sanitizer: StateSanitizer;
constructor(
client: TerraformCloudClient,
sanitizer: StateSanitizer
) {
this.client = client;
this.sanitizer = sanitizer;
}
getProcessorName(): string {
return 'TerraformCloudEntityProcessor';
}
async postProcessEntity(
entity: Entity,
_location: any,
emit: CatalogProcessorEmit
): Promise<Entity> {
// Check if entity has Terraform Cloud annotations
const workspaceId = entity.metadata.annotations?.['terraform.io/workspace-id'];
if (!workspaceId) {
return entity; // Not a Terraform-managed entity
}
try {
// Fetch latest state from Terraform Cloud
const state = await this.client.fetchWorkspaceState(workspaceId);
// Sanitize state
const sanitized = await this.sanitizer.sanitize(state);
// Extract resources and create child entities
for (const resource of sanitized.resources) {
const childEntity = this.createResourceEntity(entity, resource);
emit({ type: 'entity', entity: childEntity });
}
// Enrich parent entity with state metadata
entity.metadata.annotations = {
...entity.metadata.annotations,
'terraform.io/state-version': state.version,
'terraform.io/last-updated': state.updated_at,
'terraform.io/resource-count': sanitized.resources.length.toString()
};
return entity;
} catch (error) {
console.error(`Failed to process Terraform entity ${entity.metadata.name}:`, error);
return entity; // Return original entity on error
}
}
private createResourceEntity(parent: Entity, resource: any): Entity {
return {
apiVersion: 'backstage.io/v1alpha1',
kind: 'Resource',
metadata: {
name: `${parent.metadata.name}-${resource.type}-${resource.name}`,
namespace: parent.metadata.namespace,
annotations: {
'terraform.io/resource-address': resource.address,
'terraform.io/resource-type': resource.type,
'backstage.io/managed-by-location': `terraform:${parent.metadata.annotations?.['terraform.io/workspace-id']}`
}
},
spec: {
type: resource.type,
owner: parent.spec?.owner || 'unknown',
dependsOn: [`component:${parent.metadata.name}`],
...resource.values
}
};
}
}
7.3 Entity Provider (Real-Time Sync)
Webhook-Driven Updates:
// backend/src/providers/TerraformCloudEntityProvider.ts
import { EntityProvider, EntityProviderConnection } from '@backstage/plugin-catalog-node';
import { TerraformCloudClient } from '../service/TerraformCloudClient';
export class TerraformCloudEntityProvider implements EntityProvider {
private connection?: EntityProviderConnection;
private client: TerraformCloudClient;
constructor(client: TerraformCloudClient) {
this.client = client;
}
getProviderName(): string {
return 'TerraformCloudEntityProvider';
}
async connect(connection: EntityProviderConnection): Promise<void> {
this.connection = connection;
// Start periodic full sync (every 1 hour as backup)
setInterval(() => this.fullSync(), 3600000);
// Initial sync on startup
await this.fullSync();
}
async fullSync(): Promise<void> {
if (!this.connection) return;
console.log('Starting full Terraform Cloud sync...');
// Fetch all entities from database (already processed)
const entities = await this.fetchAllEntities();
// Apply entities to catalog
await this.connection.applyMutation({
type: 'full',
entities: entities.map(entity => ({
entity,
locationKey: `terraform:${entity.metadata.annotations?.['terraform.io/workspace-id']}`
}))
});
console.log(`Full sync completed: ${entities.length} entities`);
}
async handleWebhookEvent(event: TerraformWebhookEvent): Promise<void> {
if (!this.connection) return;
// Fetch updated entity
const entity = await this.client.fetchEntityForWorkspace(event.workspace_id);
// Apply delta update to catalog
await this.connection.applyMutation({
type: 'delta',
added: [entity],
removed: []
});
console.log(`Webhook sync completed for workspace ${event.workspace_id}`);
}
}
7.4 Frontend Plugin (UI Components)
Workspace Details Card:
// frontend/src/components/TerraformWorkspaceCard/TerraformWorkspaceCard.tsx
import React from 'react';
import { Entity } from '@backstage/catalog-model';
import { InfoCard, Link } from '@backstage/core-components';
import { useApi } from '@backstage/core-plugin-api';
import { terraformCloudApiRef } from '../../api';
export const TerraformWorkspaceCard = ({ entity }: { entity: Entity }) => {
const api = useApi(terraformCloudApiRef);
const workspaceId = entity.metadata.annotations?.['terraform.io/workspace-id'];
const [workspace, setWorkspace] = React.useState<any>(null);
const [loading, setLoading] = React.useState(true);
React.useEffect(() => {
if (workspaceId) {
api.getWorkspace(workspaceId).then(setWorkspace).finally(() => setLoading(false));
}
}, [workspaceId, api]);
if (loading) return <InfoCard title="Terraform Workspace">Loading...</InfoCard>;
if (!workspace) return null;
return (
<InfoCard title="Terraform Workspace">
<div>
<strong>Organization:</strong> {workspace.organization} <br />
<strong>Workspace:</strong> <Link to={workspace.url}>{workspace.name}</Link> <br />
<strong>State Version:</strong> {workspace.currentStateVersion} <br />
<strong>Last Updated:</strong> {new Date(workspace.updatedAt).toLocaleString()} <br />
<strong>Resource Count:</strong> {workspace.resourceCount} <br />
</div>
</InfoCard>
);
};
7.5 Configuration Schema
app-config.yaml Structure:
# Multi-tenant Terraform Cloud plugin configuration
terraformCloud:
# Enable/disable plugin
enabled: true
# Tenant-specific configurations
tenants:
- id: acme-corp
slug: acme-corp
organization: acme-corp-prod
auth:
tokenSecretRef: projects/my-project/secrets/tfc-acme-corp-token
sanitization:
rulesConfigPath: gs://backstage-config/acme-corp/sanitization-rules.yaml
allowlistPath: gs://backstage-config/acme-corp/allowlist.yaml
discovery:
github:
enabled: true
organizations:
- acme-corp
scanIntervalMinutes: 5
terraformCloud:
enabled: true
webhooksEnabled: true
- id: globex-inc
slug: globex-inc
organization: globex-production
auth:
tokenSecretRef: projects/my-project/secrets/tfc-globex-token
sanitization:
rulesConfigPath: gs://backstage-config/globex-inc/sanitization-rules.yaml
discovery:
github:
enabled: true
organizations:
- globex-inc
scanIntervalMinutes: 10
# Global rate limiting
rateLimiting:
enabled: true
requestsPerSecond: 30
burstSize: 50
# Webhook server configuration
webhooks:
enabled: true
port: 8080
path: /api/webhooks/terraform-cloud
signatureHeader: X-TFE-Notification-Signature
8. Security & Compliance Architecture
8.1 Security Boundaries
8.2 Zero Trust Architecture
Principles:
- Never Trust, Always Verify: Every request authenticated and authorized
- Least Privilege Access: Minimal permissions per component
- Assume Breach: Defense in depth with multiple layers
- Encrypt Everything: Data at rest, in transit, and in use
Implementation:
// Zero Trust middleware stack
app.use([
// 1. Validate request integrity
validateRequestSignature,
// 2. Authenticate request (JWT, API key, or mTLS)
authenticate,
// 3. Extract tenant context
extractTenantContext,
// 4. Authorize action against RBAC policies
authorize,
// 5. Inject tenant context into database session
injectTenantContext,
// 6. Audit log all requests
auditLog,
// 7. Rate limit per tenant
rateLimitByTenant
]);
// Example: RBAC policy
interface RBACPolicy {
tenant_id: string;
user_id: string;
permissions: {
action: 'read' | 'write' | 'admin';
resource: 'catalog' | 'config' | 'logs';
scope: string; // e.g., "namespace:acme-corp-dev"
}[];
}
async function authorize(req: Request, res: Response, next: NextFunction) {
const { tenantId, userId, action, resource } = req.context;
const hasPermission = await rbac.check({
tenant_id: tenantId,
user_id: userId,
action: action,
resource: resource
});
if (!hasPermission) {
return res.status(403).json({
error: 'Insufficient permissions',
required: `${action}:${resource}`
});
}
next();
}
8.3 Encryption Strategy
Data at Rest:
- Database: PostgreSQL native encryption with customer-managed keys (CMEK)
- Backups: AES-256 encrypted via Cloud Storage CMEK
- Secrets: Google Secret Manager with automatic rotation
Data in Transit:
- External: TLS 1.3 with perfect forward secrecy
- Internal: mTLS between all services (service mesh)
- Terraform Cloud API: TLS 1.3 + API token in headers
Data in Use:
- Memory Encryption: Confidential VMs (AMD SEV-SNP) for sensitive workloads
- Sanitization: In-memory processing, no disk writes for unredacted state
8.4 Compliance & Audit
SOC 2 Type II Controls:
| Control | Implementation | Evidence |
|---|---|---|
| CC6.1 - Logical Access | JWT authentication + RBAC | Access logs, auth events |
| CC6.2 - Transmission Integrity | TLS 1.3 + certificate pinning | TLS handshake logs |
| CC6.3 - Transmission Confidentiality | End-to-end encryption | Encryption audit logs |
| CC7.2 - System Monitoring | Prometheus + Grafana + alerts | Metrics dashboards |
| CC7.4 - Change Management | GitOps + peer review | Git commit history |
Audit Logging:
-- Comprehensive audit table
CREATE TABLE security_audit_log (
log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
timestamp TIMESTAMP DEFAULT NOW(),
-- Request context
tenant_id UUID NOT NULL,
user_id VARCHAR(255),
ip_address INET,
user_agent TEXT,
-- Action details
action VARCHAR(100) NOT NULL, -- e.g., 'catalog:read', 'config:write'
resource VARCHAR(500) NOT NULL,
resource_id VARCHAR(255),
-- Outcome
status INTEGER NOT NULL, -- HTTP status code
error_message TEXT,
-- For forensics
request_id UUID NOT NULL,
session_id UUID,
INDEX idx_audit_tenant_time (tenant_id, timestamp DESC),
INDEX idx_audit_action (action, timestamp DESC),
INDEX idx_audit_user (user_id, timestamp DESC)
);
-- Example query: All failed authorization attempts in last 24h
SELECT
timestamp,
tenant_id,
user_id,
action,
resource
FROM security_audit_log
WHERE status = 403
AND timestamp > NOW() - INTERVAL '24 hours'
ORDER BY timestamp DESC;
9. Technology Stack Recommendations
9.1 Core Technologies
| Layer | Technology | Justification | Alternatives |
|---|---|---|---|
| Backend Runtime | Node.js 20 LTS | Backstage native, async I/O | Deno (future) |
| Backend Framework | Express.js | Backstage plugin API compatibility | Fastify (performance) |
| Frontend | React 18 | Backstage UI framework | N/A |
| Database | PostgreSQL 15 | Backstage catalog backend, RLS support | N/A |
| Cache | Redis 7 | Low-latency caching, pub/sub | Memcached |
| Message Queue | Google Cloud Pub/Sub | Managed, scalable, at-least-once delivery | RabbitMQ, Kafka |
| Container Orchestration | GKE (Kubernetes 1.28+) | Managed, auto-scaling, GCP integration | EKS, AKS |
| Service Mesh | Istio 1.20 | mTLS, traffic management, observability | Linkerd |
| Secret Management | Google Secret Manager | Managed, automatic rotation, IAM integration | HashiCorp Vault |
| Observability | Prometheus + Grafana | Industry standard, rich ecosystem | Datadog, New Relic |
| CI/CD | GitHub Actions | Native GitHub integration | GitLab CI, Cloud Build |
9.2 Terraform Cloud SDK
Official SDK:
// Use Terraform Cloud API via @hashicorp/terraform-cloud SDK
import { TerraformCloud } from '@hashicorp/terraform-cloud';
const tfc = new TerraformCloud({
token: await secretManager.getSecret('tfc-token'),
organization: 'acme-corp-prod'
});
// Fetch workspace
const workspace = await tfc.workspaces.show('my-workspace');
// Fetch latest state
const stateVersion = await tfc.stateVersions.current(workspace.id);
const stateDownloadUrl = stateVersion.data.attributes['hosted-state-download-url'];
// Download state JSON
const state = await fetch(stateDownloadUrl).then(r => r.json());
9.3 Backstage Catalog Extensions
Custom Entity Kinds:
# Define Terraform-specific entity kinds
apiVersion: backstage.io/v1alpha1
kind: TerraformWorkspace
metadata:
name: acme-corp-prod-vpc
namespace: acme-corp
annotations:
terraform.io/workspace-id: ws-abc123
terraform.io/organization: acme-corp-prod
spec:
type: infrastructure
lifecycle: production
owner: platform-team
# Terraform-specific fields
terraform:
version: "1.6.0"
provider_versions:
google: "5.0.0"
kubernetes: "2.23.0"
locked: false
auto_apply: false
vcs:
repository: "acme-corp/infrastructure-vpc"
branch: "main"
variables:
- key: project_id
sensitive: false
- key: api_key
sensitive: true # Indicates sanitized
9.4 Monitoring & Alerting Stack
Metrics Collection:
# Prometheus ServiceMonitor for plugin
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: backstage-terraform-plugin
spec:
selector:
matchLabels:
app: backstage
component: terraform-plugin
endpoints:
- port: metrics
path: /metrics
interval: 30s
Key Metrics:
// Custom Prometheus metrics
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
const metrics = {
// Counter: Total API requests
apiRequests: new Counter({
name: 'terraform_plugin_api_requests_total',
help: 'Total API requests',
labelNames: ['tenant_id', 'method', 'endpoint', 'status']
}),
// Histogram: API latency
apiLatency: new Histogram({
name: 'terraform_plugin_api_latency_seconds',
help: 'API request latency',
labelNames: ['tenant_id', 'endpoint'],
buckets: [0.1, 0.5, 1, 2, 5, 10]
}),
// Gauge: Active workspaces per tenant
activeWorkspaces: new Gauge({
name: 'terraform_plugin_active_workspaces',
help: 'Number of active workspaces',
labelNames: ['tenant_id']
}),
// Counter: Sanitization violations
sanitizationViolations: new Counter({
name: 'terraform_plugin_sanitization_violations_total',
help: 'Total sanitization violations detected',
labelNames: ['tenant_id', 'rule_id', 'severity']
})
};
Grafana Dashboards:
- Tenant Overview: Workspaces, entities, API usage per tenant
- Performance: Latency percentiles, error rates, queue depth
- Security: Sanitization violations, auth failures, rate limit hits
- Cost: API costs, database size, compute usage per tenant
10. Deployment Architecture
10.1 Kubernetes Deployment Manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: backstage-terraform-plugin
namespace: backstage
spec:
replicas: 3
selector:
matchLabels:
app: backstage
component: terraform-plugin
template:
metadata:
labels:
app: backstage
component: terraform-plugin
spec:
serviceAccountName: backstage-terraform-sa
# Multi-container pod
containers:
# Main application
- name: plugin
image: gcr.io/my-project/backstage-terraform-plugin:v1.0.0
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
env:
- name: NODE_ENV
value: production
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: postgres-credentials
key: connection-string
- name: REDIS_URL
value: redis://redis-service:6379
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
# Cloud SQL Proxy sidecar
- name: cloud-sql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:latest
command:
- "/cloud_sql_proxy"
- "-instances=my-project:us-central1:backstage-db=tcp:5432"
securityContext:
runAsNonRoot: true
# Security context
securityContext:
fsGroup: 1000
runAsUser: 1000
runAsGroup: 1000
10.2 Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: backstage-terraform-plugin-hpa
namespace: backstage
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: backstage-terraform-plugin
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metric: Queue depth
- type: External
external:
metric:
name: pubsub.googleapis.com|subscription|num_undelivered_messages
selector:
matchLabels:
resource.labels.subscription_id: terraform-state-updates
target:
type: AverageValue
averageValue: "500"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 50 # Scale down by 50% at most
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Double capacity at most
periodSeconds: 15
- type: Pods
value: 4 # Add 4 pods at most
periodSeconds: 15
selectPolicy: Max # Take the policy that scales fastest
11. Architecture Decision Records (ADRs)
ADR-001: Row-Level Security for Tenant Isolation
Status: Accepted Date: 2024-11-13
Context: Need to isolate tenant data in a cost-effective, scalable way while maintaining strong security boundaries.
Decision: Implement PostgreSQL Row-Level Security (RLS) with tenant discriminator column.
Rationale:
- Security: Database-enforced isolation (not application-layer)
- Cost: Single database instance for all tenants (vs. separate DBs)
- Performance: Indexed tenant_id column, partitioning by tenant
- Compliance: Meets SOC 2 data isolation requirements
Consequences:
- Positive: Lower operational overhead, simpler backups
- Negative: RLS policy complexity, PostgreSQL version dependency (9.5+)
- Mitigation: Extensive testing of RLS policies, monitoring for leaks
ADR-002: Pub/Sub for Asynchronous Processing
Status: Accepted Date: 2024-11-13
Context: Need to handle burst traffic (1000+ workspace updates/minute) without blocking API requests.
Decision: Use Google Cloud Pub/Sub for message queuing between API and workers.
Rationale:
- Scalability: Managed service, auto-scaling to millions of messages
- Reliability: At-least-once delivery, dead letter queues
- Integration: Native GCP integration, IAM-based auth
Consequences:
- Positive: Decoupled architecture, elastic capacity
- Negative: Potential message duplication (at-least-once), GCP vendor lock-in
- Mitigation: Idempotent message handlers, multi-cloud strategy (future)
ADR-003: Real-Time Sync via Terraform Cloud Webhooks
Status: Accepted Date: 2024-11-13
Context: Users expect catalog updates within 5 minutes of Terraform apply.
Decision: Subscribe to Terraform Cloud webhooks for run completion events.
Rationale:
- Latency: Near-real-time updates (< 30 seconds) vs. polling (5-15 minutes)
- Efficiency: Event-driven reduces unnecessary API calls (rate limits)
- Cost: Lower Terraform Cloud API usage
Consequences:
- Positive: Faster catalog updates, better user experience
- Negative: Webhook reliability dependency, HMAC signature verification overhead
- Mitigation: Fallback polling every 1 hour, webhook retry logic
ADR-004: In-Memory State Sanitization
Status: Accepted Date: 2024-11-13
Context: Terraform state contains sensitive data (credentials, IPs) that must not reach Backstage catalog.
Decision: Sanitize state in-memory before database persistence, never write unredacted state to disk.
Rationale:
- Security: Reduces attack surface (no plaintext state on disk)
- Compliance: GDPR/CCPA compliance (no persistent PII)
- Performance: In-memory processing faster than disk I/O
Consequences:
- Positive: Stronger security posture, faster processing
- Negative: Higher memory requirements, complex sanitization logic
- Mitigation: Worker memory limits (4GB), sanitization rule versioning
12. Risk Analysis & Mitigation
12.1 High-Severity Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Cross-Tenant Data Leak | Low | Critical | RLS policies, extensive testing, audit logging, automated leak detection |
| Terraform Cloud Rate Limit | Medium | High | Per-tenant quotas, token bucket algorithm, caching, batch processing |
| Webhook Replay Attack | Low | Medium | HMAC signature verification, timestamp validation, nonce tracking |
| Database Outage | Low | High | High availability (Cloud SQL HA), automatic failover, connection pooling |
| PII Exposure in Catalog | Low | Critical | Multi-layer sanitization, regex + ML detection, audit trail, manual review queue |
12.2 Mitigation Strategies
Cross-Tenant Data Leak Prevention:
-- Automated leak detection query (run hourly)
SELECT
entity_ref,
tenant_id,
COUNT(*) OVER (PARTITION BY entity_ref) as duplicate_count
FROM catalog_entities
WHERE entity_ref IN (
SELECT entity_ref
FROM catalog_entities
GROUP BY entity_ref
HAVING COUNT(DISTINCT tenant_id) > 1
);
-- Alert if any results (entity visible to multiple tenants)
Rate Limit Handling:
// Exponential backoff with jitter
async function retryWithBackoff<T>(
fn: () => Promise<T>,
maxRetries: number = 5
): Promise<T> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (error.status === 429 && attempt < maxRetries - 1) {
const backoff = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s, 8s, 16s
const jitter = Math.random() * 1000;
await sleep(backoff + jitter);
} else {
throw error;
}
}
}
throw new Error('Max retries exceeded');
}
13. Scalability Projections
13.1 Growth Model (10 Clients → 100 Clients)
| Metric | 10 Clients | 50 Clients | 100 Clients | Notes |
|---|---|---|---|---|
| Workspaces | 2,000 | 10,000 | 20,000 | Linear growth |
| Entities | 10,000 | 50,000 | 100,000 | 5 entities/workspace |
| Daily State Syncs | 5,000 | 25,000 | 50,000 | ~0.5 syncs/workspace/day |
| Database Size | 500 MB | 2.5 GB | 5 GB | Compressed JSONB |
| API Requests/Day | 100K | 500K | 1M | 10K req/client/day |
| Queue Messages/Day | 10K | 50K | 100K | 1K msg/client/day |
| GKE Nodes | 3 | 6 | 10 | n2-standard-8 |
| Monthly Cost | $1,200 | $2,500 | $4,000 | $40/client at scale |
13.2 Breaking Points & Solutions
Database Query Performance (100K+ entities):
- Problem: Full table scans slow down at 100K+ rows
- Solution: Partitioning by tenant_id (10 partitions), covering indexes
- Target: < 100ms p95 query latency
Terraform Cloud Rate Limits (30 req/sec shared):
- Problem: 100 clients competing for 30 req/sec quota
- Solution: Per-tenant quotas (0.3 req/sec each), intelligent caching (1-hour TTL)
- Target: < 5% rate limit rejections
Memory Pressure (sanitization workload):
- Problem: In-memory sanitization requires 100-500MB per state
- Solution: Worker pods with 4GB memory, queue-based distribution, streaming JSON parsing
- Target: < 2GB memory per worker at 80% utilization
14. Implementation Roadmap
Phase 1: Foundation (Weeks 1-4)
- Set up multi-tenant PostgreSQL database with RLS
- Implement Terraform Cloud API client with rate limiting
- Build basic state sanitization engine
- Deploy backend plugin to GKE (single tenant PoC)
Phase 2: Multi-Tenant Core (Weeks 5-8)
- Implement tenant context middleware
- Build API key authentication system
- Add per-tenant sanitization rules
- Deploy Pub/Sub message queue
- Implement catalog entity processor
Phase 3: Dynamic Discovery (Weeks 9-12)
- Build GitHub repository scanner
- Implement Terraform Cloud workspace enumeration
- Add webhook event handling
- Deploy automated onboarding workflow
Phase 4: Frontend & Polish (Weeks 13-16)
- Build React UI components
- Add Terraform workspace detail cards
- Implement admin dashboard
- Write end-to-end tests
Phase 5: Production Readiness (Weeks 17-20)
- Load testing (10K concurrent users)
- Security audit (SOC 2 prep)
- Performance optimization
- Documentation & runbooks
15. Conclusion
This architecture provides an enterprise-grade foundation for a multi-tenant Backstage plugin that integrates with Terraform Cloud at scale. Key design principles:
- Security First: Zero trust, encryption everywhere, tenant isolation
- Scalability: Horizontal scaling, queue-based architecture, database partitioning
- Real-Time: Webhook-driven updates, sub-5-minute latency
- Cost-Effective: Shared infrastructure, $40/client at 100 clients
- Compliance: SOC 2 Type II ready, comprehensive audit logging
Success Metrics:
- 100+ enterprise clients supported
- 99.9% uptime SLA
- < 5 minute catalog sync latency
- < 200ms API response time (p95)
- Zero cross-tenant data leaks
Next Steps:
- Review with stakeholders (security, infrastructure, product)
- Finalize technology stack and vendor selection
- Begin Phase 1 implementation (foundation)
- Set up continuous integration and deployment pipelines
Appendix A: Glossary
- RLS: Row-Level Security (PostgreSQL feature)
- CMEK: Customer-Managed Encryption Keys
- HPA: Horizontal Pod Autoscaler (Kubernetes)
- TFC: Terraform Cloud
- RBAC: Role-Based Access Control
- SOC 2: Service Organization Control 2 (compliance standard)
Appendix B: References
- Backstage Architecture Overview
- Terraform Cloud API Documentation
- PostgreSQL Row-Level Security
- Google Cloud Pub/Sub
- SOC 2 Compliance Guide
Document Control
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-11-13 | System Architect Agent | Initial architecture design |