Multi-Tenant Database Architecture for Backstage Entities

Executive Summary

Recommendation: Row-Level Security (RLS) with PostgreSQL for 100-500 clients, transitioning to Horizontal Sharding at 500+ clients.

Key Metrics

Target Performance: Entity lookup < 50ms (50% faster than requirement)
Capacity: 100 clients × 10,000 entities = 1M entities in single DB
Isolation: Complete tenant isolation via RLS + tenant_id partitioning
Scalability: Horizontal sharding ready at 500+ clients

1. Multi-Tenancy Strategy Analysis

Option A: Schema-per-Tenant ❌

-- Separate schema for each client
CREATE SCHEMA client_acme;
CREATE SCHEMA client_widgets_inc;

CREATE TABLE client_acme.entities (
    entity_id UUID PRIMARY KEY,
    entity_ref TEXT NOT NULL,
    entity_json JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Pros:

Strong logical isolation
Simple tenant-level backups (pg_dump --schema=client_acme)
Independent schema evolution per tenant
Easy to drop entire tenant

Cons:

❌ Connection pooling nightmare: Each query must SET search_path = client_acme
❌ 1000+ schemas = PostgreSQL bloat: System catalog overhead
❌ Cross-tenant analytics impossible: No global queries
❌ Schema migrations: Must run 100+ times for each client
❌ Limited to ~1000 schemas in production

Verdict: Only viable for < 50 clients with independent deployments.

Option B: Row-Level Security (RLS) ✅ RECOMMENDED

-- Single unified schema with tenant isolation
CREATE TABLE entities (
    tenant_id UUID NOT NULL,
    entity_id UUID NOT NULL,
    entity_ref TEXT NOT NULL,
    entity_json JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (tenant_id, entity_id)
) PARTITION BY LIST (tenant_id);

-- Enable Row-Level Security
ALTER TABLE entities ENABLE ROW LEVEL SECURITY;

-- Policy: Users can only see their tenant's data
CREATE POLICY tenant_isolation ON entities
    FOR ALL
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

-- Bypass RLS for admin operations
GRANT BYPASS RLS ON entities TO admin_role;

Pros:

✅ Native PostgreSQL security: RLS enforced at kernel level
✅ Single schema migrations: Deploy once for all tenants
✅ Connection pooling friendly: Set app.tenant_id per session
✅ Cross-tenant analytics: Admin role can query all data
✅ Proven at scale: GitHub, Heroku use RLS for multi-tenancy
✅ Horizontal scaling: Partition by tenant_id for sharding

Cons:

⚠️ Must set app.tenant_id for every connection (mitigated by middleware)
⚠️ RLS adds ~2-5ms overhead per query (negligible for our use case)

Verdict: Best balance of security, performance, and scalability for 100-500 clients.

Option C: Database-per-Tenant ❌

-- Separate database per client
CREATE DATABASE backstage_client_acme;
CREATE DATABASE backstage_client_widgets_inc;

\c backstage_client_acme
CREATE TABLE entities (...);

Pros:

Ultimate isolation (separate PostgreSQL process)
Tenant-level resource limits (CPU, memory, connections)
Independent backups and disaster recovery

Cons:

❌ Connection pool explosion: 100 clients × 10 connections = 1000+ connections
❌ Operational nightmare: 100 databases to monitor, backup, upgrade
❌ Cross-tenant queries impossible: No global search or analytics
❌ Cost: 100 databases = 100× PostgreSQL instances in managed services
❌ Schema migrations: Must run 100× (with zero-downtime complexity)

Verdict: Only for SaaS with < 10 enterprise clients requiring dedicated infrastructure.

2. Backstage Catalog Schema Extension

Original Backstage Schema (Single-Tenant)

-- Backstage's default catalog schema
CREATE TABLE entities (
    entity_id TEXT PRIMARY KEY,
    final_entity JSONB,
    hash TEXT NOT NULL,
    stitch_ticket TEXT
);

CREATE TABLE relations (
    originating_entity_id TEXT NOT NULL,
    source_entity_ref TEXT NOT NULL,
    target_entity_ref TEXT NOT NULL,
    type TEXT NOT NULL
);

CREATE TABLE refresh_state (
    entity_id TEXT PRIMARY KEY,
    entity_ref TEXT NOT NULL,
    unprocessed_entity JSONB,
    processed_entity JSONB,
    cache JSONB,
    errors TEXT
);

Multi-Tenant Extension (RLS-Based)

-- Core entities table with tenant isolation
CREATE TABLE entities (
    tenant_id UUID NOT NULL,
    entity_id UUID NOT NULL,
    entity_ref TEXT NOT NULL, -- e.g., "component:default/my-service"
    entity_kind TEXT NOT NULL, -- component, api, system, resource
    entity_namespace TEXT NOT NULL DEFAULT 'default',
    entity_name TEXT NOT NULL,

    -- Backstage fields
    final_entity JSONB NOT NULL, -- Complete entity YAML as JSON
    hash TEXT NOT NULL, -- Entity content hash for change detection
    stitch_ticket TEXT, -- For entity stitching during refresh

    -- Metadata
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    deleted_at TIMESTAMPTZ, -- Soft delete

    PRIMARY KEY (tenant_id, entity_id),
    UNIQUE (tenant_id, entity_ref)
) PARTITION BY LIST (tenant_id);

-- Entity relationships (multi-tenant)
CREATE TABLE relations (
    tenant_id UUID NOT NULL,
    relation_id UUID NOT NULL DEFAULT gen_random_uuid(),

    originating_entity_id UUID NOT NULL,
    source_entity_ref TEXT NOT NULL,
    target_entity_ref TEXT NOT NULL,
    type TEXT NOT NULL, -- ownedBy, dependsOn, apiProvidedBy, etc.

    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

    PRIMARY KEY (tenant_id, relation_id),
    FOREIGN KEY (tenant_id, originating_entity_id)
        REFERENCES entities(tenant_id, entity_id) ON DELETE CASCADE
) PARTITION BY LIST (tenant_id);

-- Refresh state for catalog ingestion
CREATE TABLE refresh_state (
    tenant_id UUID NOT NULL,
    entity_id UUID NOT NULL,
    entity_ref TEXT NOT NULL,

    unprocessed_entity JSONB, -- Raw YAML from source
    processed_entity JSONB, -- After validation
    cache JSONB, -- Location cache
    errors TEXT, -- Ingestion errors

    last_refresh TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    next_refresh TIMESTAMPTZ,

    PRIMARY KEY (tenant_id, entity_id),
    FOREIGN KEY (tenant_id, entity_id)
        REFERENCES entities(tenant_id, entity_id) ON DELETE CASCADE
) PARTITION BY LIST (tenant_id);

-- Row-Level Security Policies
ALTER TABLE entities ENABLE ROW LEVEL SECURITY;
ALTER TABLE relations ENABLE ROW LEVEL SECURITY;
ALTER TABLE refresh_state ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation_entities ON entities
    FOR ALL
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

CREATE POLICY tenant_isolation_relations ON relations
    FOR ALL
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

CREATE POLICY tenant_isolation_refresh ON refresh_state
    FOR ALL
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

3. Entity ID Strategy

Approach: UUID-based with Tenant Column + Backstage-Compatible Refs

-- Entity storage
INSERT INTO entities (
    tenant_id,
    entity_id,
    entity_ref,
    entity_kind,
    entity_namespace,
    entity_name,
    final_entity,
    hash
) VALUES (
    '550e8400-e29b-41d4-a716-446655440000', -- tenant_id (UUID)
    'c73b9df0-6c42-4f3e-9d1a-5b2e8f3a7d9b', -- entity_id (UUID)
    'component:default/my-service', -- Backstage entity ref
    'component',
    'default',
    'my-service',
    '{"apiVersion": "backstage.io/v1alpha1", ...}'::jsonb,
    'abc123hash'
);

Why This Strategy?

UUID entity_id:
- Globally unique, no collisions
- Efficient indexing (16 bytes vs TEXT)
- Native PostgreSQL support
Backstage-compatible entity_ref:
- Format: {kind}:{namespace}/{name}
- Example: component:default/my-service
- Preserves Backstage semantics
- No tenant prefix in refs (cleaner UI)
Composite Primary Key (tenant_id, entity_id):
- Enforces uniqueness per tenant
- Enables partition pruning
- Prevents cross-tenant ID leaks
Namespace Collision Prevention:
- Tenant ID in partition key
- UNIQUE constraint on (tenant_id, entity_ref)
- Application-level validation

4. Performance Optimization

Index Strategy

-- Primary indexes (automatic from PRIMARY KEY)
CREATE INDEX idx_entities_pk ON entities (tenant_id, entity_id);

-- Entity lookups by ref (most common query)
CREATE INDEX idx_entities_ref ON entities (tenant_id, entity_ref)
    WHERE deleted_at IS NULL;

-- Entity kind filtering (list all APIs, components, etc.)
CREATE INDEX idx_entities_kind ON entities (tenant_id, entity_kind, created_at DESC)
    WHERE deleted_at IS NULL;

-- Full-text search on entity names
CREATE INDEX idx_entities_name_trgm ON entities
    USING gin (entity_name gin_trgm_ops)
    WHERE deleted_at IS NULL;

-- JSONB search (GIN index)
CREATE INDEX idx_entities_json ON entities
    USING gin (final_entity jsonb_path_ops);

-- Relations lookup
CREATE INDEX idx_relations_source ON relations (tenant_id, source_entity_ref);
CREATE INDEX idx_relations_target ON relations (tenant_id, target_entity_ref);
CREATE INDEX idx_relations_type ON relations (tenant_id, type);

-- Refresh state queries
CREATE INDEX idx_refresh_next ON refresh_state (tenant_id, next_refresh)
    WHERE next_refresh IS NOT NULL;

Partitioning Strategy

-- Create partitions for top 10 tenants (hot data)
CREATE TABLE entities_tenant_acme PARTITION OF entities
    FOR VALUES IN ('550e8400-e29b-41d4-a716-446655440000');

CREATE TABLE entities_tenant_widgets PARTITION OF entities
    FOR VALUES IN ('660e8400-e29b-41d4-a716-446655440001');

-- Default partition for new tenants
CREATE TABLE entities_default PARTITION OF entities DEFAULT;

-- Automatically create partitions on tenant creation
CREATE OR REPLACE FUNCTION create_tenant_partition()
RETURNS TRIGGER AS $$
BEGIN
    EXECUTE format(
        'CREATE TABLE IF NOT EXISTS entities_tenant_%s PARTITION OF entities FOR VALUES IN (%L)',
        NEW.tenant_id,
        NEW.tenant_id
    );
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER tenant_partition_trigger
    AFTER INSERT ON tenants
    FOR EACH ROW
    EXECUTE FUNCTION create_tenant_partition();

Query Patterns & Performance

-- Query 1: Get entity by ref (most common) - Target: < 20ms
SET app.tenant_id = '550e8400-e29b-41d4-a716-446655440000';
SELECT final_entity
FROM entities
WHERE entity_ref = 'component:default/my-service'
  AND deleted_at IS NULL;
-- Index used: idx_entities_ref
-- Partition pruning: YES
-- Expected: 5-15ms

-- Query 2: List all components - Target: < 50ms
SELECT entity_ref, entity_name, created_at
FROM entities
WHERE entity_kind = 'component'
  AND deleted_at IS NULL
ORDER BY created_at DESC
LIMIT 100;
-- Index used: idx_entities_kind
-- Expected: 20-40ms

-- Query 3: Search entities by name - Target: < 100ms
SELECT entity_ref, entity_name
FROM entities
WHERE entity_name ILIKE '%auth%'
  AND deleted_at IS NULL
LIMIT 50;
-- Index used: idx_entities_name_trgm
-- Expected: 50-100ms

-- Query 4: Get entity with all relations - Target: < 100ms
WITH entity_data AS (
    SELECT * FROM entities WHERE entity_ref = 'component:default/my-service'
)
SELECT
    e.*,
    json_agg(r.*) AS relations
FROM entity_data e
LEFT JOIN relations r ON r.source_entity_ref = e.entity_ref
GROUP BY e.entity_id;
-- Indexes used: idx_entities_ref, idx_relations_source
-- Expected: 30-80ms

Caching Layer

-- Redis cache strategy
Key: tenant:{tenant_id}:entity:{entity_ref}
TTL: 300 seconds (5 minutes)
Invalidation: On entity UPDATE/DELETE

-- Application-level caching
class EntityCache {
    async getEntity(tenantId, entityRef) {
        const cacheKey = `tenant:${tenantId}:entity:${entityRef}`;

        // Try Redis first
        let entity = await redis.get(cacheKey);
        if (entity) return JSON.parse(entity);

        // Cache miss: Query database
        entity = await db.query(
            'SELECT final_entity FROM entities WHERE tenant_id = $1 AND entity_ref = $2',
            [tenantId, entityRef]
        );

        // Cache for 5 minutes
        await redis.setex(cacheKey, 300, JSON.stringify(entity));
        return entity;
    }
}

Connection Pooling

// Per-tenant connection pooling with tenant_id session variable
const { Pool } = require('pg');

class TenantPool {
    constructor() {
        this.pool = new Pool({
            host: 'localhost',
            database: 'backstage_multi_tenant',
            max: 100, // Total connections
            idleTimeoutMillis: 30000,
            connectionTimeoutMillis: 2000,
        });
    }

    async withTenant(tenantId, callback) {
        const client = await this.pool.connect();
        try {
            // Set tenant context for RLS
            await client.query('SET app.tenant_id = $1', [tenantId]);
            return await callback(client);
        } finally {
            // Reset context before returning to pool
            await client.query('RESET app.tenant_id');
            client.release();
        }
    }
}

// Usage
const pool = new TenantPool();
const entities = await pool.withTenant(tenantId, async (client) => {
    return client.query('SELECT * FROM entities WHERE entity_kind = $1', ['component']);
});

5. Scalability Projections

Scenario 1: 10 Clients, 1K Entities Each (10K Total)

Database: Single PostgreSQL instance
Hardware: 4 vCPUs, 16GB RAM, 100GB SSD
Performance:
- Entity lookup: 5-10ms
- List queries: 10-20ms
- Search: 30-50ms
Cost: ~$100/month (AWS RDS db.m5.large)

Scenario 2: 100 Clients, 10K Entities Each (1M Total)

Database: Single PostgreSQL with read replicas
Hardware: 8 vCPUs, 32GB RAM, 500GB SSD
Partitioning: 100 tenant partitions + default
Performance:
- Entity lookup: 10-20ms (partition pruning helps)
- List queries: 30-60ms
- Search: 60-120ms
Read Replicas: 2× replicas for read scaling
Caching: Redis cluster (3 nodes, 8GB each)
Cost: ~$800/month (Primary + 2 replicas + Redis)

Scenario 3: 1000 Clients, 100K Entities Each (100M Total)

Database: Horizontal sharding required
Sharding Strategy: Hash-based on tenant_id
- Shard 1: Tenants 1-250 (25M entities)
- Shard 2: Tenants 251-500 (25M entities)
- Shard 3: Tenants 501-750 (25M entities)
- Shard 4: Tenants 751-1000 (25M entities)
Hardware per shard: 16 vCPUs, 64GB RAM, 1TB SSD
Routing: Application-level shard router
Performance:
- Entity lookup: 20-40ms (shard routing + query)
- Cross-shard queries: Not supported (by design)
Alternative: CockroachDB (auto-sharding, no app changes)
Cost: ~$4,000/month (4 shards + orchestration)

Storage Estimation

Average entity size: 5KB (JSON)
Relations per entity: 5 (avg)
Relation size: 200 bytes

Per entity storage:
- Entity: 5KB
- Relations: 5 × 200 bytes = 1KB
- Refresh state: 2KB
- Indexes: 2KB (estimated overhead)
Total: ~10KB per entity

Scalability:
- 10K entities: 100MB
- 100K entities: 1GB
- 1M entities: 10GB
- 10M entities: 100GB
- 100M entities: 1TB (with indexes)

Read/Write Patterns

Typical workload:
- 90% reads (entity lookups, searches)
- 10% writes (catalog refreshes)

Peak load (100 clients):
- 1000 reads/sec
- 100 writes/sec
- Burst capacity: 5000 reads/sec during catalog refresh

Database sizing:
- Single instance handles: 10K QPS (with caching)
- Read replicas needed: At 2K+ sustained QPS
- Sharding needed: At 10K+ sustained QPS

6. Data Lifecycle Management

Entity TTL (Time-to-Live)

-- Stale entity detection (not refreshed in 30 days)
CREATE INDEX idx_entities_stale ON entities (tenant_id, updated_at)
    WHERE deleted_at IS NULL;

-- Automated cleanup job (daily)
CREATE OR REPLACE FUNCTION cleanup_stale_entities()
RETURNS void AS $$
BEGIN
    -- Soft delete entities not updated in 90 days
    UPDATE entities
    SET deleted_at = NOW()
    WHERE updated_at < NOW() - INTERVAL '90 days'
      AND deleted_at IS NULL;

    -- Hard delete entities soft-deleted for 365 days
    DELETE FROM entities
    WHERE deleted_at < NOW() - INTERVAL '365 days';
END;
$$ LANGUAGE plpgsql;

-- Schedule with pg_cron
SELECT cron.schedule('cleanup-stale-entities', '0 2 * * *', 'SELECT cleanup_stale_entities()');

Soft Deletes

-- Soft delete implementation
UPDATE entities
SET deleted_at = NOW()
WHERE tenant_id = '550e8400-e29b-41d4-a716-446655440000'
  AND entity_ref = 'component:default/deprecated-service';

-- Restore soft-deleted entity
UPDATE entities
SET deleted_at = NULL
WHERE tenant_id = '550e8400-e29b-41d4-a716-446655440000'
  AND entity_ref = 'component:default/deprecated-service';

-- All queries filter out soft-deleted entities
SELECT * FROM entities
WHERE tenant_id = current_setting('app.tenant_id')::UUID
  AND deleted_at IS NULL;

Entity Change History

-- Audit log for entity changes
CREATE TABLE entity_history (
    history_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL,
    entity_id UUID NOT NULL,
    entity_ref TEXT NOT NULL,

    operation TEXT NOT NULL, -- INSERT, UPDATE, DELETE
    old_entity JSONB,
    new_entity JSONB,
    changed_by TEXT, -- User or service account
    changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

    FOREIGN KEY (tenant_id, entity_id)
        REFERENCES entities(tenant_id, entity_id) ON DELETE CASCADE
) PARTITION BY RANGE (changed_at);

-- Partition by month for efficient querying
CREATE TABLE entity_history_2024_01 PARTITION OF entity_history
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Trigger to capture changes
CREATE OR REPLACE FUNCTION log_entity_changes()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' THEN
        INSERT INTO entity_history (tenant_id, entity_id, entity_ref, operation, new_entity, changed_by)
        VALUES (NEW.tenant_id, NEW.entity_id, NEW.entity_ref, 'INSERT', NEW.final_entity, current_user);
    ELSIF TG_OP = 'UPDATE' THEN
        INSERT INTO entity_history (tenant_id, entity_id, entity_ref, operation, old_entity, new_entity, changed_by)
        VALUES (NEW.tenant_id, NEW.entity_id, NEW.entity_ref, 'UPDATE', OLD.final_entity, NEW.final_entity, current_user);
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO entity_history (tenant_id, entity_id, entity_ref, operation, old_entity, changed_by)
        VALUES (OLD.tenant_id, OLD.entity_id, OLD.entity_ref, 'DELETE', OLD.final_entity, current_user);
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER entity_changes_trigger
    AFTER INSERT OR UPDATE OR DELETE ON entities
    FOR EACH ROW
    EXECUTE FUNCTION log_entity_changes();

Backup and Restore

-- Per-tenant backup (logical backup)
pg_dump \
    --host=localhost \
    --dbname=backstage_multi_tenant \
    --table=entities \
    --where="tenant_id = '550e8400-e29b-41d4-a716-446655440000'" \
    --format=custom \
    --file=tenant_acme_backup_2024_01_15.dump

-- Per-tenant restore
pg_restore \
    --host=localhost \
    --dbname=backstage_multi_tenant \
    --table=entities \
    --data-only \
    tenant_acme_backup_2024_01_15.dump

-- Global backup (all tenants)
pg_dump \
    --host=localhost \
    --dbname=backstage_multi_tenant \
    --format=custom \
    --file=full_backup_2024_01_15.dump

-- Continuous archiving (PITR - Point-in-Time Recovery)
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'cp %p /mnt/backups/wal/%f'

# Daily base backup
pg_basebackup -D /mnt/backups/base/$(date +%Y%m%d) -Ft -z -Xs

# Restore to specific timestamp
pg_restore --target-time='2024-01-15 14:30:00' /mnt/backups/base/20240115

7. Technology Trade-offs Analysis

PostgreSQL (Recommended) ✅

Why PostgreSQL?

Native Backstage support (default database)
Row-Level Security (RLS) for tenant isolation
JSONB for entity storage (fast queries)
Mature replication (streaming, logical)
Rich indexing (GIN, GiST, BRIN, B-tree)
Proven at scale (Notion, Instagram, Reddit)

Limits:

Single-node write bottleneck (mitigated by read replicas)
Manual sharding required at 100M+ entities
Connection limit (~500 default, tune to 1000)

Best For: 10-500 clients, 1M-50M entities

MySQL (Alternative)

Why MySQL?

Slightly faster for simple key-value lookups
Galera Cluster for multi-master writes
InnoDB supports row-level locking

Why Not?

❌ No native RLS (must implement in app layer)
❌ Weaker JSON support (JSONB in Postgres is superior)
❌ No LISTEN/NOTIFY for real-time updates
❌ Backstage defaults to Postgres

Verdict: Only if team expertise strongly favors MySQL.

CockroachDB (Distributed, 1000+ Clients)

Why CockroachDB?

Automatic horizontal sharding (no app changes)
Built-in multi-region replication
PostgreSQL-compatible (Backstage works)
Survives node failures (no downtime)

Trade-offs:

2-3× higher latency than Postgres (cross-node coordination)
Higher cost (~5× Postgres for same throughput)
Learning curve (new consistency model)

Best For: 1000+ clients, 100M+ entities, multi-region

Google Cloud Spanner (Enterprise, Global Scale)

Why Spanner?

Global consistency (TrueTime)
99.999% SLA (5 minutes downtime/year)
Automatic sharding and replication
Managed service (zero ops)

Trade-offs:

10-50ms latency (global coordination)
Very high cost ($1,000+/month minimum)
Not PostgreSQL-compatible (schema changes needed)

Best For: Global enterprises, regulated industries, unlimited budget

8. Migration Path from Single-Tenant Backstage

Phase 1: Add Tenant ID (Zero Downtime)

-- Step 1: Add tenant_id column (nullable initially)
ALTER TABLE entities ADD COLUMN tenant_id UUID;
ALTER TABLE relations ADD COLUMN tenant_id UUID;
ALTER TABLE refresh_state ADD COLUMN tenant_id UUID;

-- Step 2: Backfill tenant_id for existing data (single tenant)
UPDATE entities SET tenant_id = '00000000-0000-0000-0000-000000000001';
UPDATE relations SET tenant_id = '00000000-0000-0000-0000-000000000001';
UPDATE refresh_state SET tenant_id = '00000000-0000-0000-0000-000000000001';

-- Step 3: Make tenant_id NOT NULL
ALTER TABLE entities ALTER COLUMN tenant_id SET NOT NULL;
ALTER TABLE relations ALTER COLUMN tenant_id SET NOT NULL;
ALTER TABLE refresh_state ALTER COLUMN tenant_id SET NOT NULL;

Phase 2: Update Primary Keys

-- Step 4: Drop old primary key and create composite key
ALTER TABLE entities DROP CONSTRAINT entities_pkey;
ALTER TABLE entities ADD PRIMARY KEY (tenant_id, entity_id);

-- Step 5: Add indexes
CREATE INDEX idx_entities_ref ON entities (tenant_id, entity_ref);
CREATE INDEX idx_entities_kind ON entities (tenant_id, entity_kind);

Phase 3: Enable RLS

-- Step 6: Enable Row-Level Security
ALTER TABLE entities ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON entities
    FOR ALL
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

-- Step 7: Grant bypass to admin
GRANT BYPASS RLS ON entities TO backstage_admin;

Phase 4: Partition by Tenant

-- Step 8: Create new partitioned table
CREATE TABLE entities_new (LIKE entities INCLUDING ALL)
    PARTITION BY LIST (tenant_id);

-- Step 9: Migrate data
INSERT INTO entities_new SELECT * FROM entities;

-- Step 10: Swap tables (requires maintenance window)
BEGIN;
ALTER TABLE entities RENAME TO entities_old;
ALTER TABLE entities_new RENAME TO entities;
COMMIT;

9. Complete DDL Schema

See /docs/database/schema.sql for full implementation.

10. Recommendations Summary

For 100 Clients (Current Requirement)

Use PostgreSQL with Row-Level Security
Partition by tenant_id (100 partitions + default)
Deploy read replicas (2×) for read scaling
Add Redis caching (5-minute TTL)
Index strategy: entity_ref, entity_kind, JSONB paths
Monitor query performance: Target < 50ms for entity lookups

For 500+ Clients (Future Scaling)

Horizontal sharding by tenant_id hash
CockroachDB migration for auto-sharding
Multi-region replication if global

Security Checklist

✅ RLS policies on all tables
✅ Tenant ID validation in middleware
✅ Audit logging of entity changes
✅ Encrypted connections (SSL/TLS)
✅ Backup encryption at rest

Performance Targets

Entity lookup by ref: < 20ms (90th percentile)
List entities by kind: < 50ms (90th percentile)
Full-text search: < 100ms (90th percentile)
Catalog refresh (1000 entities): < 30 seconds

Appendix: Query Performance Testing

-- Performance testing script
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Run 1000 entity lookups
DO $$
DECLARE
    tenant UUID := '550e8400-e29b-41d4-a716-446655440000';
    i INT;
BEGIN
    SET app.tenant_id = tenant;

    FOR i IN 1..1000 LOOP
        PERFORM * FROM entities
        WHERE entity_ref = 'component:default/service-' || i::TEXT
          AND deleted_at IS NULL;
    END LOOP;
END $$;

-- Check performance
SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    query
FROM pg_stat_statements
WHERE query LIKE '%entities%'
ORDER BY mean_exec_time DESC;

Executive Summary​

Key Metrics​

1. Multi-Tenancy Strategy Analysis​

Option A: Schema-per-Tenant ❌​

Option B: Row-Level Security (RLS) ✅ RECOMMENDED​

Option C: Database-per-Tenant ❌​

2. Backstage Catalog Schema Extension​

Original Backstage Schema (Single-Tenant)​

Multi-Tenant Extension (RLS-Based)​

3. Entity ID Strategy​

Approach: UUID-based with Tenant Column + Backstage-Compatible Refs​

Why This Strategy?​

4. Performance Optimization​

Index Strategy​

Partitioning Strategy​

Query Patterns & Performance​

Caching Layer​

Connection Pooling​

5. Scalability Projections​

Scenario 1: 10 Clients, 1K Entities Each (10K Total)​

Scenario 2: 100 Clients, 10K Entities Each (1M Total)​

Scenario 3: 1000 Clients, 100K Entities Each (100M Total)​

Storage Estimation​

Read/Write Patterns​

6. Data Lifecycle Management​

Entity TTL (Time-to-Live)​

Soft Deletes​

Entity Change History​

Backup and Restore​

7. Technology Trade-offs Analysis​

PostgreSQL (Recommended) ✅​

MySQL (Alternative)​

CockroachDB (Distributed, 1000+ Clients)​

Google Cloud Spanner (Enterprise, Global Scale)​

8. Migration Path from Single-Tenant Backstage​

Phase 1: Add Tenant ID (Zero Downtime)​

Phase 2: Update Primary Keys​

Phase 3: Enable RLS​

Phase 4: Partition by Tenant​

9. Complete DDL Schema​

10. Recommendations Summary​

For 100 Clients (Current Requirement)​

For 500+ Clients (Future Scaling)​

Security Checklist​

Performance Targets​

Appendix: Query Performance Testing​

Executive Summary

Key Metrics

1. Multi-Tenancy Strategy Analysis

Option A: Schema-per-Tenant ❌

Option B: Row-Level Security (RLS) ✅ RECOMMENDED

Option C: Database-per-Tenant ❌

2. Backstage Catalog Schema Extension

Original Backstage Schema (Single-Tenant)

Multi-Tenant Extension (RLS-Based)

3. Entity ID Strategy

Approach: UUID-based with Tenant Column + Backstage-Compatible Refs

Why This Strategy?

4. Performance Optimization

Index Strategy

Partitioning Strategy

Query Patterns & Performance

Caching Layer

Connection Pooling

5. Scalability Projections

Scenario 1: 10 Clients, 1K Entities Each (10K Total)

Scenario 2: 100 Clients, 10K Entities Each (1M Total)

Scenario 3: 1000 Clients, 100K Entities Each (100M Total)

Storage Estimation

Read/Write Patterns

6. Data Lifecycle Management

Entity TTL (Time-to-Live)

Soft Deletes

Entity Change History

Backup and Restore

7. Technology Trade-offs Analysis

PostgreSQL (Recommended) ✅

MySQL (Alternative)

CockroachDB (Distributed, 1000+ Clients)

Google Cloud Spanner (Enterprise, Global Scale)

8. Migration Path from Single-Tenant Backstage

Phase 1: Add Tenant ID (Zero Downtime)

Phase 2: Update Primary Keys

Phase 3: Enable RLS

Phase 4: Partition by Tenant

9. Complete DDL Schema

10. Recommendations Summary

For 100 Clients (Current Requirement)

For 500+ Clients (Future Scaling)

Security Checklist

Performance Targets

Appendix: Query Performance Testing