Skip to main content

Onboarding Workflow State Machine

State Machine Overview

The onboarding workflow is implemented as a finite state machine with clear transitions, rollback paths, and retry logic.

                                START


┌─────────────────────────┐
│ RECEIVED │
│ • Event captured │
│ • Assigned correlation │
└───────────┬─────────────┘


┌─────────────────────────┐
│ VALIDATING │──────┐
│ • Tenant check │ │
│ • BU detection │ │ INVALID
│ • Duplicate check │ │
└───────────┬─────────────┘ │
│ VALID │
▼ ▼
┌─────────────────────────┐ ┌─────────┐
│ DISCOVERING │ │ REJECTED│
│ • Find TFC workspace │ └─────────┘
│ • Fetch repo metadata │
│ • Download state │
└───────────┬─────────────┘


┌─────────────────────────┐
│ EXTRACTING │
│ • Parse metadata │──────┐
│ • Sanitize state │ │
│ • Extract resources │ │ FAILED
└───────────┬─────────────┘ │
│ SUCCESS │
▼ │
┌─────────────────────────┐ │
│ GENERATING │ │
│ • Create Domain entity │──────┤
│ • Create System entity │ │
│ • Create Components │ │
└───────────┬─────────────┘ │
│ │
▼ │
┌─────────────────────────┐ │
│ REGISTERING │ │
│ • Insert to catalog │──────┤
│ • Update relationships │ │
│ • Validate consistency │ │
└───────────┬─────────────┘ │
│ │
▼ │
┌─────────────────────────┐ │
│ CONFIGURING_SYNC │ │
│ • Setup TFC webhooks │──────┤
│ • Schedule sync jobs │ │
│ • Configure watchers │ │
└───────────┬─────────────┘ │
│ │
▼ │
┌─────────────────────────┐ │
│ NOTIFYING │ │
│ • Send notifications │ │
│ • Update status │ │
└───────────┬─────────────┘ │
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────┐
│ COMPLETED │ │ FAILED │──┐
└─────────────────────────┘ └─────────┘ │


┌──────────────────┐
│ ROLLING_BACK │
│ • Delete entities│
│ • Clean up hooks│
└──────────────────┘

State Definitions

RECEIVED

Entry: Event received from trigger mechanism Actions:

  • Generate unique onboarding ID
  • Assign correlation ID
  • Persist event to database
  • Acknowledge event source

Transitions:

  • → VALIDATING (always)

Timeout: 1 second


VALIDATING

Purpose: Verify the event is valid and eligible for onboarding

Actions:

  1. Tenant Validation

    • Resolve tenant from GitHub org or TFC org
    • Verify tenant exists and is active
    • Check tenant has necessary permissions
  2. Business Unit Detection

    • Run detection algorithms (see 04-detection-algorithms.md)
    • Verify repository/workspace matches BU criteria
    • Extract BU identifier
  3. Duplicate Detection

    • Check if already onboarded (by repo URL)
    • Check if onboarding in progress
    • Verify not in failed state (unless force=true)
  4. Permission Check

    • Verify authenticated user can onboard for this tenant
    • Check rate limits not exceeded

Transitions:

  • → DISCOVERING (if valid)
  • → REJECTED (if invalid)

Timeout: 10 seconds Retry: No retry (validation must pass on first attempt)


DISCOVERING

Purpose: Gather all necessary data from external systems

Actions:

  1. GitHub Data

    const repoData = await github.repos.get({ owner, repo });
    const topics = await github.repos.getAllTopics({ owner, repo });
    const content = await github.repos.getContent({
    owner,
    repo,
    path: '.backstage/config.yaml'
    });
  2. Terraform Cloud Data

    const workspace = await tfc.getWorkspace(workspaceId);
    const currentState = await tfc.getCurrentState(workspaceId);
    const runs = await tfc.listRuns(workspaceId, { limit: 5 });
    const outputs = await tfc.getWorkspaceOutputs(workspaceId);
  3. GCP Data (optional)

    const folder = await gcp.cloudresourcemanager.folders.get({ name: folderId });
    const projects = await gcp.cloudresourcemanager.projects.list({ parent: folderId });

Transitions:

  • → EXTRACTING (if all data retrieved)
  • → FAILED (if critical data missing)

Timeout: 60 seconds Retry: Yes, exponential backoff (3 attempts)


EXTRACTING

Purpose: Parse and transform raw data into structured metadata

Actions:

  1. Metadata Extraction

    const metadata = {
    businessUnit: extractBusinessUnit(repoName, workspace.tags),
    owner: extractOwner(repoData, backstageConfig),
    system: extractSystem(workspace.name),
    environment: extractEnvironment(workspace.name),
    gcpFolder: extractFolderId(terraformState),
    gcpProjects: extractProjects(terraformState),
    customMetadata: backstageConfig?.metadata || {},
    };
  2. State Sanitization

    const sanitized = sanitizeState(terraformState, {
    removeSecrets: true,
    includeOutputs: true,
    includeResources: ['google_folder', 'google_project', 'google_service'],
    });
  3. Resource Mapping

    const resources = terraformState.resources
    .filter(r => r.type.startsWith('google_'))
    .map(r => ({
    type: r.type,
    name: r.name,
    id: r.instances[0]?.attributes?.id,
    metadata: extractResourceMetadata(r),
    }));

Transitions:

  • → GENERATING (if extraction successful)
  • → FAILED (if required metadata missing)

Timeout: 30 seconds Retry: Yes (3 attempts with exponential backoff)


GENERATING

Purpose: Create Backstage entity YAML definitions

Actions:

  1. Domain Entity

    apiVersion: backstage.io/v1alpha1
    kind: Domain
    metadata:
    name: acme-finance
    namespace: tenant-123
    annotations:
    backstage.io/managed-by-location: url:https://github.com/acme-corp/bu-finance-infrastructure
    cloud.google.com/folder-id: "folders/123456"
    labels:
    tenant: tenant-123
    business-unit: finance
    spec:
    owner: group:finance-leadership
  2. System Entity

    apiVersion: backstage.io/v1alpha1
    kind: System
    metadata:
    name: acme-finance-infrastructure
    namespace: tenant-123
    annotations:
    backstage.io/managed-by-location: url:https://github.com/acme-corp/bu-finance-infrastructure
    terraform.io/workspace: bu-finance-infrastructure
    cloud.google.com/folder-id: "folders/123456"
    spec:
    owner: group:finance-engineering
    domain: acme-finance
  3. Component Entities (for each GCP project)

    apiVersion: backstage.io/v1alpha1
    kind: Component
    metadata:
    name: finance-seed-project
    namespace: tenant-123
    annotations:
    backstage.io/managed-by-location: url:https://github.com/acme-corp/bu-finance-infrastructure
    cloud.google.com/project-id: finance-seed-prod-a1b2
    spec:
    type: gcp-project
    lifecycle: production
    owner: group:finance-engineering
    system: acme-finance-infrastructure
    providesApis: []
    dependsOn: []

Transitions:

  • → REGISTERING (if generation successful)
  • → FAILED (if entity validation fails)

Timeout: 20 seconds Retry: Yes (3 attempts)


REGISTERING

Purpose: Insert entities into Backstage catalog database

Actions:

  1. Transactional Insert

    await db.transaction(async (trx) => {
    // Insert domain
    const domainId = await trx('entities').insert({
    api_version: 'backstage.io/v1alpha1',
    kind: 'Domain',
    metadata: JSON.stringify(domainEntity.metadata),
    spec: JSON.stringify(domainEntity.spec),
    tenant_id: tenantId,
    });

    // Insert system
    const systemId = await trx('entities').insert({
    api_version: 'backstage.io/v1alpha1',
    kind: 'System',
    metadata: JSON.stringify(systemEntity.metadata),
    spec: JSON.stringify(systemEntity.spec),
    tenant_id: tenantId,
    });

    // Insert components
    for (const component of componentEntities) {
    await trx('entities').insert({
    api_version: 'backstage.io/v1alpha1',
    kind: 'Component',
    metadata: JSON.stringify(component.metadata),
    spec: JSON.stringify(component.spec),
    tenant_id: tenantId,
    });
    }

    // Update relationships
    await trx('entity_relationships').insert([
    { source_id: systemId, target_id: domainId, type: 'partOf' },
    ...componentEntities.map(c => ({
    source_id: c.id,
    target_id: systemId,
    type: 'partOf',
    })),
    ]);
    });
  2. Validation

    // Verify entities are queryable
    const entities = await catalog.getEntities({
    filter: {
    'metadata.namespace': tenantId,
    'metadata.annotations.backstage.io/managed-by-location': repoUrl,
    },
    });

    if (entities.length !== expectedCount) {
    throw new Error('Entity count mismatch after registration');
    }

Transitions:

  • → CONFIGURING_SYNC (if registration successful)
  • → FAILED (if database insert fails)

Timeout: 30 seconds Retry: Yes, with exponential backoff (5 attempts) Rollback: Delete all inserted entities on failure


CONFIGURING_SYNC

Purpose: Set up ongoing synchronization mechanisms

Actions:

  1. TFC Webhook

    await tfc.createNotification({
    workspaceId,
    name: 'backstage-sync',
    destinationType: 'generic',
    url: `https://backstage.example.com/api/sync/tfc-webhook`,
    token: await this.generateWebhookToken(workspaceId),
    triggers: ['run:completed'],
    enabled: true,
    });
  2. Scheduled Sync Job

    await scheduler.schedule({
    name: `sync-${workspaceId}`,
    schedule: '0 */6 * * *', // Every 6 hours
    job: {
    type: 'terraform-state-sync',
    workspaceId,
    tenantId,
    entities: entityIds,
    },
    });
  3. GitHub Watcher

    await github.repos.createWebhook({
    owner,
    repo,
    config: {
    url: `https://backstage.example.com/api/sync/github-webhook`,
    content_type: 'json',
    secret: await this.generateWebhookSecret(repoUrl),
    },
    events: ['push'],
    });

Transitions:

  • → NOTIFYING (if sync setup successful)
  • → COMPLETED (if notifications disabled)
  • → FAILED (if webhook creation fails)

Timeout: 30 seconds Retry: Yes (3 attempts) Rollback: Delete webhooks on failure


NOTIFYING

Purpose: Notify stakeholders of successful onboarding

Actions:

  1. Slack Notification

    await slack.postMessage({
    channel: '#backstage-onboarding',
    text: `✅ Business Unit onboarded successfully!`,
    blocks: [
    {
    type: 'section',
    text: {
    type: 'mrkdwn',
    text: `*Business Unit:* ${metadata.businessUnit}\n` +
    `*Owner:* ${metadata.owner}\n` +
    `*Repository:* <${repoUrl}|${repoName}>\n` +
    `*Workspace:* ${workspaceName}`,
    },
    },
    {
    type: 'actions',
    elements: [
    {
    type: 'button',
    text: { type: 'plain_text', text: 'View in Backstage' },
    url: `https://backstage.example.com/catalog/${tenantId}/${systemEntity.metadata.name}`,
    },
    ],
    },
    ],
    });
  2. Email Notification

    await email.send({
    to: metadata.owner,
    subject: `Backstage: ${metadata.businessUnit} onboarded`,
    template: 'onboarding-success',
    data: {
    businessUnit: metadata.businessUnit,
    backstageUrl: `https://backstage.example.com/catalog/${tenantId}/${systemEntity.metadata.name}`,
    repoUrl,
    workspaceName,
    entitiesCreated: entities.length,
    },
    });

Transitions:

  • → COMPLETED (always)

Timeout: 10 seconds Retry: No (notification failures don't fail onboarding)


COMPLETED

Final State: Onboarding successfully completed

Actions:

  • Update onboarding record status to 'completed'
  • Record completion timestamp
  • Generate audit log entry

Persistence:

await db('onboarding_history').insert({
id: onboardingId,
tenant_id: tenantId,
repo_url: repoUrl,
workspace_id: workspaceId,
status: 'completed',
entities_created: entityIds,
duration_ms: Date.now() - startTime,
completed_at: new Date(),
});

FAILED

Final State: Onboarding failed

Actions:

  • Record failure reason
  • Trigger rollback if necessary
  • Send failure notification
  • Generate detailed error report

Persistence:

await db('onboarding_history').insert({
id: onboardingId,
tenant_id: tenantId,
repo_url: repoUrl,
status: 'failed',
error: error.message,
error_state: currentState,
stack_trace: error.stack,
failed_at: new Date(),
});

REJECTED

Final State: Event rejected during validation

Actions:

  • Record rejection reason
  • No rollback needed (no entities created)

Common Reasons:

  • Not a business unit repository
  • Duplicate onboarding attempt
  • Tenant not authorized
  • Invalid metadata

ROLLING_BACK

Cleanup State: Undo partial onboarding

Actions:

  1. Delete Entities

    await db.transaction(async (trx) => {
    // Delete in reverse order (relationships first)
    await trx('entity_relationships')
    .whereIn('source_id', entityIds)
    .orWhereIn('target_id', entityIds)
    .delete();

    await trx('entities')
    .whereIn('id', entityIds)
    .delete();
    });
  2. Remove Webhooks

    await tfc.deleteNotification(workspaceId, 'backstage-sync');
    await github.repos.deleteWebhook(owner, repo, webhookId);
  3. Cancel Scheduled Jobs

    await scheduler.cancel(`sync-${workspaceId}`);

Transitions:

  • → FAILED (after cleanup)

Implementation

State Machine Engine

// src/onboarding/workflows/state-machine.ts
import { StateMachine } from 'state-machine-lib';

export class OnboardingStateMachine {
private machine: StateMachine<OnboardingState, OnboardingEvent>;

constructor(
private context: OnboardingContext,
private handlers: StateHandlers
) {
this.machine = new StateMachine({
initial: 'RECEIVED',
states: {
RECEIVED: {
on: { VALIDATE: 'VALIDATING' },
timeout: 1000,
handler: this.handlers.handleReceived,
},
VALIDATING: {
on: {
VALID: 'DISCOVERING',
INVALID: 'REJECTED',
},
timeout: 10000,
handler: this.handlers.handleValidating,
},
DISCOVERING: {
on: {
DISCOVERED: 'EXTRACTING',
DISCOVERY_FAILED: 'FAILED',
},
timeout: 60000,
retry: { attempts: 3, backoff: 'exponential' },
handler: this.handlers.handleDiscovering,
},
EXTRACTING: {
on: {
EXTRACTED: 'GENERATING',
EXTRACTION_FAILED: 'FAILED',
},
timeout: 30000,
retry: { attempts: 3, backoff: 'exponential' },
handler: this.handlers.handleExtracting,
},
GENERATING: {
on: {
GENERATED: 'REGISTERING',
GENERATION_FAILED: 'FAILED',
},
timeout: 20000,
retry: { attempts: 3, backoff: 'exponential' },
handler: this.handlers.handleGenerating,
},
REGISTERING: {
on: {
REGISTERED: 'CONFIGURING_SYNC',
REGISTRATION_FAILED: 'ROLLING_BACK',
},
timeout: 30000,
retry: { attempts: 5, backoff: 'exponential' },
handler: this.handlers.handleRegistering,
},
CONFIGURING_SYNC: {
on: {
SYNC_CONFIGURED: 'NOTIFYING',
SYNC_FAILED: 'COMPLETED', // Non-critical, continue
},
timeout: 30000,
retry: { attempts: 3, backoff: 'exponential' },
handler: this.handlers.handleConfigureSync,
},
NOTIFYING: {
on: { NOTIFIED: 'COMPLETED' },
timeout: 10000,
handler: this.handlers.handleNotifying,
},
COMPLETED: {
type: 'final',
handler: this.handlers.handleCompleted,
},
FAILED: {
type: 'final',
handler: this.handlers.handleFailed,
},
REJECTED: {
type: 'final',
handler: this.handlers.handleRejected,
},
ROLLING_BACK: {
on: { ROLLED_BACK: 'FAILED' },
timeout: 60000,
handler: this.handlers.handleRollback,
},
},
});
}

async run(): Promise<OnboardingResult> {
try {
await this.machine.start(this.context);
return this.machine.getResult();
} catch (error) {
console.error('State machine execution failed:', error);
throw error;
}
}
}

State Persistence

// src/onboarding/workflows/state-persistence.ts
export class StatePersistence {
async saveState(onboardingId: string, state: OnboardingState, context: OnboardingContext) {
await db('onboarding_state').upsert({
onboarding_id: onboardingId,
current_state: state,
context: JSON.stringify(context),
updated_at: new Date(),
});
}

async loadState(onboardingId: string): Promise<{ state: OnboardingState; context: OnboardingContext }> {
const record = await db('onboarding_state')
.where('onboarding_id', onboardingId)
.first();

if (!record) {
throw new Error(`Onboarding ${onboardingId} not found`);
}

return {
state: record.current_state,
context: JSON.parse(record.context),
};
}

async resumeOnboarding(onboardingId: string): Promise<OnboardingResult> {
const { state, context } = await this.loadState(onboardingId);

// Resume from saved state
const machine = new OnboardingStateMachine(context, handlers);
machine.setState(state);

return await machine.run();
}
}

Monitoring & Observability

Metrics

// Record metrics at each state transition
metrics.gauge('onboarding.active_count', { state: currentState });
metrics.histogram('onboarding.state_duration_ms', durationMs, { state: currentState });
metrics.counter('onboarding.transitions', { from: prevState, to: currentState });

Logging

logger.info('State transition', {
onboardingId,
from: prevState,
to: currentState,
duration: durationMs,
correlationId: context.correlationId,
tenantId: context.tenantId,
});

Alerting

// Alert on prolonged states
if (currentState === 'DISCOVERING' && durationMs > 60000) {
alerts.send({
severity: 'warning',
message: `Onboarding ${onboardingId} stuck in DISCOVERING for ${durationMs}ms`,
});
}

// Alert on failures
if (currentState === 'FAILED') {
alerts.send({
severity: 'error',
message: `Onboarding ${onboardingId} failed at state ${context.failedState}`,
error: context.error,
});
}

Next Steps

See 04-detection-algorithms.md for business unit detection logic.