Onboarding Workflow State Machine
State Machine Overview
The onboarding workflow is implemented as a finite state machine with clear transitions, rollback paths, and retry logic.
START
│
▼
┌─────────────────────────┐
│ RECEIVED │
│ • Event captured │
│ • Assigned correlation │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ VALIDATING │──────┐
│ • Tenant check │ │
│ • BU detection │ │ INVALID
│ • Duplicate check │ │
└───────────┬─────────────┘ │
│ VALID │
▼ ▼
┌─────────────────────────┐ ┌─────────┐
│ DISCOVERING │ │ REJECTED│
│ • Find TFC workspace │ └─────────┘
│ • Fetch repo metadata │
│ • Download state │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ EXTRACTING │
│ • Parse metadata │──────┐
│ • Sanitize state │ │
│ • Extract resources │ │ FAILED
└───────────┬─────────────┘ │
│ SUCCESS │
▼ │
┌─────────────────────────┐ │
│ GENERATING │ │
│ • Create Domain entity │──────┤
│ • Create System entity │ │
│ • Create Components │ │
└───────────┬─────────────┘ │
│ │
▼ │
┌─────────────────────────┐ │
│ REGISTERING │ │
│ • Insert to catalog │──────┤
│ • Update relationships │ │
│ • Validate consistency │ │
└───────────┬─────────────┘ │
│ │
▼ │
┌─────────────────────────┐ │
│ CONFIGURING_SYNC │ │
│ • Setup TFC webhooks │──────┤
│ • Schedule sync jobs │ │
│ • Configure watchers │ │
└───────────┬─────────────┘ │
│ │
▼ │
┌─────────────────────────┐ │
│ NOTIFYING │ │
│ • Send notifications │ │
│ • Update status │ │
└───────────┬─────────────┘ │
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────┐
│ COMPLETED │ │ FAILED │──┐
└─────────────────────────┘ └─────────┘ │
│
▼
┌──────────────────┐
│ ROLLING_BACK │
│ • Delete entities│
│ • Clean up hooks│
└──────────────────┘
State Definitions
RECEIVED
Entry: Event received from trigger mechanism Actions:
- Generate unique onboarding ID
- Assign correlation ID
- Persist event to database
- Acknowledge event source
Transitions:
- → VALIDATING (always)
Timeout: 1 second
VALIDATING
Purpose: Verify the event is valid and eligible for onboarding
Actions:
-
Tenant Validation
- Resolve tenant from GitHub org or TFC org
- Verify tenant exists and is active
- Check tenant has necessary permissions
-
Business Unit Detection
- Run detection algorithms (see 04-detection-algorithms.md)
- Verify repository/workspace matches BU criteria
- Extract BU identifier
-
Duplicate Detection
- Check if already onboarded (by repo URL)
- Check if onboarding in progress
- Verify not in failed state (unless force=true)
-
Permission Check
- Verify authenticated user can onboard for this tenant
- Check rate limits not exceeded
Transitions:
- → DISCOVERING (if valid)
- → REJECTED (if invalid)
Timeout: 10 seconds Retry: No retry (validation must pass on first attempt)
DISCOVERING
Purpose: Gather all necessary data from external systems
Actions:
-
GitHub Data
const repoData = await github.repos.get({ owner, repo });
const topics = await github.repos.getAllTopics({ owner, repo });
const content = await github.repos.getContent({
owner,
repo,
path: '.backstage/config.yaml'
}); -
Terraform Cloud Data
const workspace = await tfc.getWorkspace(workspaceId);
const currentState = await tfc.getCurrentState(workspaceId);
const runs = await tfc.listRuns(workspaceId, { limit: 5 });
const outputs = await tfc.getWorkspaceOutputs(workspaceId); -
GCP Data (optional)
const folder = await gcp.cloudresourcemanager.folders.get({ name: folderId });
const projects = await gcp.cloudresourcemanager.projects.list({ parent: folderId });
Transitions:
- → EXTRACTING (if all data retrieved)
- → FAILED (if critical data missing)
Timeout: 60 seconds Retry: Yes, exponential backoff (3 attempts)
EXTRACTING
Purpose: Parse and transform raw data into structured metadata
Actions:
-
Metadata Extraction
const metadata = {
businessUnit: extractBusinessUnit(repoName, workspace.tags),
owner: extractOwner(repoData, backstageConfig),
system: extractSystem(workspace.name),
environment: extractEnvironment(workspace.name),
gcpFolder: extractFolderId(terraformState),
gcpProjects: extractProjects(terraformState),
customMetadata: backstageConfig?.metadata || {},
}; -
State Sanitization
const sanitized = sanitizeState(terraformState, {
removeSecrets: true,
includeOutputs: true,
includeResources: ['google_folder', 'google_project', 'google_service'],
}); -
Resource Mapping
const resources = terraformState.resources
.filter(r => r.type.startsWith('google_'))
.map(r => ({
type: r.type,
name: r.name,
id: r.instances[0]?.attributes?.id,
metadata: extractResourceMetadata(r),
}));
Transitions:
- → GENERATING (if extraction successful)
- → FAILED (if required metadata missing)
Timeout: 30 seconds Retry: Yes (3 attempts with exponential backoff)
GENERATING
Purpose: Create Backstage entity YAML definitions
Actions:
-
Domain Entity
apiVersion: backstage.io/v1alpha1
kind: Domain
metadata:
name: acme-finance
namespace: tenant-123
annotations:
backstage.io/managed-by-location: url:https://github.com/acme-corp/bu-finance-infrastructure
cloud.google.com/folder-id: "folders/123456"
labels:
tenant: tenant-123
business-unit: finance
spec:
owner: group:finance-leadership -
System Entity
apiVersion: backstage.io/v1alpha1
kind: System
metadata:
name: acme-finance-infrastructure
namespace: tenant-123
annotations:
backstage.io/managed-by-location: url:https://github.com/acme-corp/bu-finance-infrastructure
terraform.io/workspace: bu-finance-infrastructure
cloud.google.com/folder-id: "folders/123456"
spec:
owner: group:finance-engineering
domain: acme-finance -
Component Entities (for each GCP project)
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: finance-seed-project
namespace: tenant-123
annotations:
backstage.io/managed-by-location: url:https://github.com/acme-corp/bu-finance-infrastructure
cloud.google.com/project-id: finance-seed-prod-a1b2
spec:
type: gcp-project
lifecycle: production
owner: group:finance-engineering
system: acme-finance-infrastructure
providesApis: []
dependsOn: []
Transitions:
- → REGISTERING (if generation successful)
- → FAILED (if entity validation fails)
Timeout: 20 seconds Retry: Yes (3 attempts)
REGISTERING
Purpose: Insert entities into Backstage catalog database
Actions:
-
Transactional Insert
await db.transaction(async (trx) => {
// Insert domain
const domainId = await trx('entities').insert({
api_version: 'backstage.io/v1alpha1',
kind: 'Domain',
metadata: JSON.stringify(domainEntity.metadata),
spec: JSON.stringify(domainEntity.spec),
tenant_id: tenantId,
});
// Insert system
const systemId = await trx('entities').insert({
api_version: 'backstage.io/v1alpha1',
kind: 'System',
metadata: JSON.stringify(systemEntity.metadata),
spec: JSON.stringify(systemEntity.spec),
tenant_id: tenantId,
});
// Insert components
for (const component of componentEntities) {
await trx('entities').insert({
api_version: 'backstage.io/v1alpha1',
kind: 'Component',
metadata: JSON.stringify(component.metadata),
spec: JSON.stringify(component.spec),
tenant_id: tenantId,
});
}
// Update relationships
await trx('entity_relationships').insert([
{ source_id: systemId, target_id: domainId, type: 'partOf' },
...componentEntities.map(c => ({
source_id: c.id,
target_id: systemId,
type: 'partOf',
})),
]);
}); -
Validation
// Verify entities are queryable
const entities = await catalog.getEntities({
filter: {
'metadata.namespace': tenantId,
'metadata.annotations.backstage.io/managed-by-location': repoUrl,
},
});
if (entities.length !== expectedCount) {
throw new Error('Entity count mismatch after registration');
}
Transitions:
- → CONFIGURING_SYNC (if registration successful)
- → FAILED (if database insert fails)
Timeout: 30 seconds Retry: Yes, with exponential backoff (5 attempts) Rollback: Delete all inserted entities on failure
CONFIGURING_SYNC
Purpose: Set up ongoing synchronization mechanisms
Actions:
-
TFC Webhook
await tfc.createNotification({
workspaceId,
name: 'backstage-sync',
destinationType: 'generic',
url: `https://backstage.example.com/api/sync/tfc-webhook`,
token: await this.generateWebhookToken(workspaceId),
triggers: ['run:completed'],
enabled: true,
}); -
Scheduled Sync Job
await scheduler.schedule({
name: `sync-${workspaceId}`,
schedule: '0 */6 * * *', // Every 6 hours
job: {
type: 'terraform-state-sync',
workspaceId,
tenantId,
entities: entityIds,
},
}); -
GitHub Watcher
await github.repos.createWebhook({
owner,
repo,
config: {
url: `https://backstage.example.com/api/sync/github-webhook`,
content_type: 'json',
secret: await this.generateWebhookSecret(repoUrl),
},
events: ['push'],
});
Transitions:
- → NOTIFYING (if sync setup successful)
- → COMPLETED (if notifications disabled)
- → FAILED (if webhook creation fails)
Timeout: 30 seconds Retry: Yes (3 attempts) Rollback: Delete webhooks on failure
NOTIFYING
Purpose: Notify stakeholders of successful onboarding
Actions:
-
Slack Notification
await slack.postMessage({
channel: '#backstage-onboarding',
text: `✅ Business Unit onboarded successfully!`,
blocks: [
{
type: 'section',
text: {
type: 'mrkdwn',
text: `*Business Unit:* ${metadata.businessUnit}\n` +
`*Owner:* ${metadata.owner}\n` +
`*Repository:* <${repoUrl}|${repoName}>\n` +
`*Workspace:* ${workspaceName}`,
},
},
{
type: 'actions',
elements: [
{
type: 'button',
text: { type: 'plain_text', text: 'View in Backstage' },
url: `https://backstage.example.com/catalog/${tenantId}/${systemEntity.metadata.name}`,
},
],
},
],
}); -
Email Notification
await email.send({
to: metadata.owner,
subject: `Backstage: ${metadata.businessUnit} onboarded`,
template: 'onboarding-success',
data: {
businessUnit: metadata.businessUnit,
backstageUrl: `https://backstage.example.com/catalog/${tenantId}/${systemEntity.metadata.name}`,
repoUrl,
workspaceName,
entitiesCreated: entities.length,
},
});
Transitions:
- → COMPLETED (always)
Timeout: 10 seconds Retry: No (notification failures don't fail onboarding)
COMPLETED
Final State: Onboarding successfully completed
Actions:
- Update onboarding record status to 'completed'
- Record completion timestamp
- Generate audit log entry
Persistence:
await db('onboarding_history').insert({
id: onboardingId,
tenant_id: tenantId,
repo_url: repoUrl,
workspace_id: workspaceId,
status: 'completed',
entities_created: entityIds,
duration_ms: Date.now() - startTime,
completed_at: new Date(),
});
FAILED
Final State: Onboarding failed
Actions:
- Record failure reason
- Trigger rollback if necessary
- Send failure notification
- Generate detailed error report
Persistence:
await db('onboarding_history').insert({
id: onboardingId,
tenant_id: tenantId,
repo_url: repoUrl,
status: 'failed',
error: error.message,
error_state: currentState,
stack_trace: error.stack,
failed_at: new Date(),
});
REJECTED
Final State: Event rejected during validation
Actions:
- Record rejection reason
- No rollback needed (no entities created)
Common Reasons:
- Not a business unit repository
- Duplicate onboarding attempt
- Tenant not authorized
- Invalid metadata
ROLLING_BACK
Cleanup State: Undo partial onboarding
Actions:
-
Delete Entities
await db.transaction(async (trx) => {
// Delete in reverse order (relationships first)
await trx('entity_relationships')
.whereIn('source_id', entityIds)
.orWhereIn('target_id', entityIds)
.delete();
await trx('entities')
.whereIn('id', entityIds)
.delete();
}); -
Remove Webhooks
await tfc.deleteNotification(workspaceId, 'backstage-sync');
await github.repos.deleteWebhook(owner, repo, webhookId); -
Cancel Scheduled Jobs
await scheduler.cancel(`sync-${workspaceId}`);
Transitions:
- → FAILED (after cleanup)
Implementation
State Machine Engine
// src/onboarding/workflows/state-machine.ts
import { StateMachine } from 'state-machine-lib';
export class OnboardingStateMachine {
private machine: StateMachine<OnboardingState, OnboardingEvent>;
constructor(
private context: OnboardingContext,
private handlers: StateHandlers
) {
this.machine = new StateMachine({
initial: 'RECEIVED',
states: {
RECEIVED: {
on: { VALIDATE: 'VALIDATING' },
timeout: 1000,
handler: this.handlers.handleReceived,
},
VALIDATING: {
on: {
VALID: 'DISCOVERING',
INVALID: 'REJECTED',
},
timeout: 10000,
handler: this.handlers.handleValidating,
},
DISCOVERING: {
on: {
DISCOVERED: 'EXTRACTING',
DISCOVERY_FAILED: 'FAILED',
},
timeout: 60000,
retry: { attempts: 3, backoff: 'exponential' },
handler: this.handlers.handleDiscovering,
},
EXTRACTING: {
on: {
EXTRACTED: 'GENERATING',
EXTRACTION_FAILED: 'FAILED',
},
timeout: 30000,
retry: { attempts: 3, backoff: 'exponential' },
handler: this.handlers.handleExtracting,
},
GENERATING: {
on: {
GENERATED: 'REGISTERING',
GENERATION_FAILED: 'FAILED',
},
timeout: 20000,
retry: { attempts: 3, backoff: 'exponential' },
handler: this.handlers.handleGenerating,
},
REGISTERING: {
on: {
REGISTERED: 'CONFIGURING_SYNC',
REGISTRATION_FAILED: 'ROLLING_BACK',
},
timeout: 30000,
retry: { attempts: 5, backoff: 'exponential' },
handler: this.handlers.handleRegistering,
},
CONFIGURING_SYNC: {
on: {
SYNC_CONFIGURED: 'NOTIFYING',
SYNC_FAILED: 'COMPLETED', // Non-critical, continue
},
timeout: 30000,
retry: { attempts: 3, backoff: 'exponential' },
handler: this.handlers.handleConfigureSync,
},
NOTIFYING: {
on: { NOTIFIED: 'COMPLETED' },
timeout: 10000,
handler: this.handlers.handleNotifying,
},
COMPLETED: {
type: 'final',
handler: this.handlers.handleCompleted,
},
FAILED: {
type: 'final',
handler: this.handlers.handleFailed,
},
REJECTED: {
type: 'final',
handler: this.handlers.handleRejected,
},
ROLLING_BACK: {
on: { ROLLED_BACK: 'FAILED' },
timeout: 60000,
handler: this.handlers.handleRollback,
},
},
});
}
async run(): Promise<OnboardingResult> {
try {
await this.machine.start(this.context);
return this.machine.getResult();
} catch (error) {
console.error('State machine execution failed:', error);
throw error;
}
}
}
State Persistence
// src/onboarding/workflows/state-persistence.ts
export class StatePersistence {
async saveState(onboardingId: string, state: OnboardingState, context: OnboardingContext) {
await db('onboarding_state').upsert({
onboarding_id: onboardingId,
current_state: state,
context: JSON.stringify(context),
updated_at: new Date(),
});
}
async loadState(onboardingId: string): Promise<{ state: OnboardingState; context: OnboardingContext }> {
const record = await db('onboarding_state')
.where('onboarding_id', onboardingId)
.first();
if (!record) {
throw new Error(`Onboarding ${onboardingId} not found`);
}
return {
state: record.current_state,
context: JSON.parse(record.context),
};
}
async resumeOnboarding(onboardingId: string): Promise<OnboardingResult> {
const { state, context } = await this.loadState(onboardingId);
// Resume from saved state
const machine = new OnboardingStateMachine(context, handlers);
machine.setState(state);
return await machine.run();
}
}
Monitoring & Observability
Metrics
// Record metrics at each state transition
metrics.gauge('onboarding.active_count', { state: currentState });
metrics.histogram('onboarding.state_duration_ms', durationMs, { state: currentState });
metrics.counter('onboarding.transitions', { from: prevState, to: currentState });
Logging
logger.info('State transition', {
onboardingId,
from: prevState,
to: currentState,
duration: durationMs,
correlationId: context.correlationId,
tenantId: context.tenantId,
});
Alerting
// Alert on prolonged states
if (currentState === 'DISCOVERING' && durationMs > 60000) {
alerts.send({
severity: 'warning',
message: `Onboarding ${onboardingId} stuck in DISCOVERING for ${durationMs}ms`,
});
}
// Alert on failures
if (currentState === 'FAILED') {
alerts.send({
severity: 'error',
message: `Onboarding ${onboardingId} failed at state ${context.failedState}`,
error: context.error,
});
}
Next Steps
See 04-detection-algorithms.md for business unit detection logic.